KDB - Find duplicates or similar entries in one column - kdb

I'm trying to eliminate duplicate entries for customers in my contact list. Assume my table has three columns (FirstName, LastName, CustomerID).
Can somebody help me create a query that identifies different CustomerIDs with either the same or very similar First and Last Names? We end up with multiple entries due to sales people searching for a name and not finding it due to misspellings. They then create a new entry for the customer with a slightly different spelling of the name.
Thanks!

One approach is to manage a mapping of names to common (mis)spellings and then map all the various spellings back to the intended name. Then group them.
t:([] fn:100?(`John;`Mike;`Bob;`john;`Johnn;`Mick;`Bobby);ln:100?(`Doe;`Smith;`doe;`Do;`smith);id:til 100)
mapFN:exec similar!name from ungroup flip `name`similar!flip (
(`Bob; (`Bob;`bob;`Bobby;`bobby));
(`John; (`John;`Johnn;`john));
(`Mike; (`Mike;`mike;`Mick;`Michael))
);
mapLN:exec similar!name from ungroup flip `name`similar!flip (
(`Doe; (`Doe;`doe;`Do));
(`Smith; (`Smith;`smith;`Smyth))
);
Without mapping:
q)`fn`ln xgroup t
fn ln | id
-----------| ----------------
Mick Do | 0 25 26 50 68 71
Bobby Smith| 1 22 23 83
John Smith| 2 8 48 51 69 85
Mike Doe | 3 44
john doe | ,4
Mick Doe | 5 47 95
John Doe | 6 46 49 63
john Smith| 7 66 74
Johnn doe | 9 13 79 94
Mick doe | 10 20 55 67
Bobby smith| 11 17 18 53
john Doe | 12 21 56
...
With mapping:
q)`fn`ln xgroup update mapFN[fn],mapLN[ln] from t
fn ln | id
----------| -----------------------------------------------------------------
Mike Doe | 0 3 5 10 20 25 26 39 44 47 50 52 55 67 68 70 71 78 95 97
Bob Smith| 1 11 17 18 22 23 30 38 45 53 77 82 83
John Smith| 2 7 8 16 19 33 37 40 43 48 51 64 66 69 73 74 80 85 87
John Doe | 4 6 9 12 13 21 31 32 41 42 46 49 56 57 62 63 65 72 79 81 86 89 91
Bob Doe | 14 24 27 28 35 54 58 59 61 75 76 84
Mike Smith| 15 29 34 36 60 88 90 93 96 98
You could also do something more sophisticated with regex pattern matching.
The mapping would need to be pretty precise though as otherwise you might end up with false groupings

Related

How to sum columns of a matrix for a specified number of columns?

I have a matrix A of size 2500 x 500. I want to sum each 10 columns and get the result as a matrix B of size 2500 x 50. That is, the first column of B is the sum of the first 10 columns of A, the second column of B is the sum of second 10 columns of A, and so on.
How can I do that without a for loop? Since I have to do that hundreds of times and it is highly time consuming to do that using for loop.
First, we "block reshape" A, such that we have the desired number of columns. Therefore, we shamelessly steal the code from the great Divakar, and put in some minimal effort to generalize it. Then, we just need to sum along the second axis, and reshape to the original form.
Here's an example with five columns to be summed:
% Sample input data
A = reshape(1:100, 10, 10).'
[r, c] = size(A);
% Number of columns to be summed
n_cols = 5;
% Block reshape to n_cols, see https://stackoverflow.com/a/40508999/11089932
B = reshape(permute(reshape(A, r, n_cols, []), [1, 3, 2]), [], n_cols);
% Sum along second axis
B = sum(B, 2);
% Reshape to original form
B = reshape(B, r, c / n_cols)
That's the output:
A =
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100
B =
15 40
65 90
115 140
165 190
215 240
265 290
315 340
365 390
415 440
465 490
Hope that helps!
This can be done with splitapply. An advantage of this approach is that it works even if the group size does not divide the number of columns (the last group is smaller):
A = reshape(1:120, 12, 10).'; % example 10×12 data (borrowed from HansHirse)
n_cols = 5; % number of columns to sum over
result = splitapply(#(x)sum(x,2), A, ceil((1:size(A,2))/n_cols));
In this example,
A =
1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36
37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70 71 72
73 74 75 76 77 78 79 80 81 82 83 84
85 86 87 88 89 90 91 92 93 94 95 96
97 98 99 100 101 102 103 104 105 106 107 108
109 110 111 112 113 114 115 116 117 118 119 120
result =
15 40 23
75 100 47
135 160 71
195 220 95
255 280 119
315 340 143
375 400 167
435 460 191
495 520 215
555 580 239

Retrieve valid values for parameter PaperSize of Set-PrintConfiguration

We're trying to verify valid input for the parameter PaperSize of Set-PrintConfiguration.
We're trying to create an array with all possible accepted values for the argument:
$testCommand = Get-Command Set-PrintConfiguration
$testCommand.Parameters.PaperSize
$testPaperSie = [Microsoft.PowerShell.Cmdletization.GeneratedTypes.PrinterConfiguration.PaperSizeEnum]
$testPaperSie.DeclaredFields.Name
This does return a list of options but it also includes a value like value__ which does not seem to be suggested by intellisense. This makes me think the query for valid values is incorrect.
To get the possible values from the PaperKind enum, you can do something like this:
function Get-Enum {
param (
[type]$Type
)
if ($Type.BaseType.FullName -ne 'System.Enum') {
Write-Error "Type '$Type' is not an enum"
}
else {
[enum]::GetNames($Type) | ForEach-Object {
$exp = "([$Type]::$($_)).value__"
[PSCustomObject] #{
'Name' = $_
'Value' = Invoke-Expression -Command $exp
}
}
}
}
Get-Enum System.Drawing.Printing.PaperKind
On my machine it returns:
Name Value
---- -----
Custom 0
Letter 1
LetterSmall 2
Tabloid 3
Ledger 4
Legal 5
Statement 6
Executive 7
A3 8
A4 9
A4Small 10
A5 11
B4 12
B5 13
Folio 14
Quarto 15
Standard10x14 16
Standard11x17 17
Note 18
Number9Envelope 19
Number10Envelope 20
Number11Envelope 21
Number12Envelope 22
Number14Envelope 23
CSheet 24
DSheet 25
ESheet 26
DLEnvelope 27
C5Envelope 28
C3Envelope 29
C4Envelope 30
C6Envelope 31
C65Envelope 32
B4Envelope 33
B5Envelope 34
B6Envelope 35
ItalyEnvelope 36
MonarchEnvelope 37
PersonalEnvelope 38
USStandardFanfold 39
GermanStandardFanfold 40
GermanLegalFanfold 41
IsoB4 42
JapanesePostcard 43
Standard9x11 44
Standard10x11 45
Standard15x11 46
InviteEnvelope 47
LetterExtra 50
LegalExtra 51
TabloidExtra 52
A4Extra 53
LetterTransverse 54
A4Transverse 55
LetterExtraTransverse 56
APlus 57
BPlus 58
LetterPlus 59
A4Plus 60
A5Transverse 61
B5Transverse 62
A3Extra 63
A5Extra 64
B5Extra 65
A2 66
A3Transverse 67
A3ExtraTransverse 68
JapaneseDoublePostcard 69
A6 70
JapaneseEnvelopeKakuNumber2 71
JapaneseEnvelopeKakuNumber3 72
JapaneseEnvelopeChouNumber3 73
JapaneseEnvelopeChouNumber4 74
LetterRotated 75
A3Rotated 76
A4Rotated 77
A5Rotated 78
B4JisRotated 79
B5JisRotated 80
JapanesePostcardRotated 81
JapaneseDoublePostcardRotated 82
A6Rotated 83
JapaneseEnvelopeKakuNumber2Rotated 84
JapaneseEnvelopeKakuNumber3Rotated 85
JapaneseEnvelopeChouNumber3Rotated 86
JapaneseEnvelopeChouNumber4Rotated 87
B6Jis 88
B6JisRotated 89
Standard12x11 90
JapaneseEnvelopeYouNumber4 91
JapaneseEnvelopeYouNumber4Rotated 92
Prc16K 93
Prc32K 94
Prc32KBig 95
PrcEnvelopeNumber1 96
PrcEnvelopeNumber2 97
PrcEnvelopeNumber3 98
PrcEnvelopeNumber4 99
PrcEnvelopeNumber5 100
PrcEnvelopeNumber6 101
PrcEnvelopeNumber7 102
PrcEnvelopeNumber8 103
PrcEnvelopeNumber9 104
PrcEnvelopeNumber10 105
Prc16KRotated 106
Prc32KRotated 107
Prc32KBigRotated 108
PrcEnvelopeNumber1Rotated 109
PrcEnvelopeNumber2Rotated 110
PrcEnvelopeNumber3Rotated 111
PrcEnvelopeNumber4Rotated 112
PrcEnvelopeNumber5Rotated 113
PrcEnvelopeNumber6Rotated 114
PrcEnvelopeNumber7Rotated 115
PrcEnvelopeNumber8Rotated 116
PrcEnvelopeNumber9Rotated 117
PrcEnvelopeNumber10Rotated 118
Hope that helps

using recursion for solving Euler 18 in q

I have written this code in q for solving Euler 18 problem,as described in the link below, using recursion.
https://stackoverflow.com/questions/8002252/euler-project-18-approach
Though the code works, it is not efficient and gets stack overflow at pyramids of sizes greater than 3000. How could I make this code much more efficient.I believe the optimum code can be less than 30 characters.
pyr:{[x]
lsize:count x;
y:x;
$[lsize <=1;y[0];
[.ds.lastone:x[lsize - 1];
.ds.lasttwo:x[lsize - 2];
y:{{max (.ds.lasttwo)[x] +/: .ds.lastone[x],.ds.lastone[x+1]}each til count .ds.lasttwo};
$[(count .ds.lasttwo)=1;y:{max (.ds.lasttwo) +/: .ds.lastone[x],.ds.lastone[x+1]}0;y:y[]];
x[lsize - 2]:y;
pyr[-1_x]]]
}
To properly implement this logic in q you need to use adverbs.
First, to quickly find the rolling maximums you can use the prior adverb. For example:
q)input:(75;95 64;17 47 82;18 35 87 10;20 04 82 47 65;19 01 23 75 03 34;88 02 77 73 07 63 67;99 65 04 28 06 16 70 92;41 41 26 56 83 40 80 70 33;41 48 72 33 47 32 37 16 94 29;53 71 44 65 25 43 91 52 97 51 14;70 11 33 28 77 73 17 78 39 68 17 57;91 71 52 38 17 14 91 43 58 50 27 29 48;63 66 04 68 89 53 67 30 73 16 69 87 40 31;04 62 98 27 23 09 70 98 73 93 38 53 60 04 23)
q)last input
4 62 98 27 23 9 70 98 73 93 38 53 60 4 23
q)1_(|) prior last input
62 98 98 27 23 70 98 98 93 93 53 60 60 23
That last line outputs the a vector with the maximum value between each successive pair in the input vector. Once you have this you can add it to the next row and repeat.
q)foo:{y+1_(|) prior x}
q)foo[input 14;input 13]
125 164 102 95 112 123 165 128 166 109 122 147 100 54
Then, to apply this function over the whole use the over adverb:
q)foo over reverse input
,1074
EDIT: The approach above can be generalized further.
q provides a moving max function mmax. With this you can find "the x-item moving maximum of numeric y", which generalizes the use of prior above. For example, you can use this to find the moving maximum of pairs or triplets in the last row of the input:
q)last input
4 62 98 27 23 9 70 98 73 93 38 53 60 4 23
q)2 mmax last input
4 62 98 98 27 23 70 98 98 93 93 53 60 60 23
q)3 mmax last input
4 62 98 98 98 27 70 98 98 98 93 93 60 60 60
mmax can be used to simplify foo above:
q)foo:{y+1_ 2 mmax x}
What's especially nice about this is that it can be used to generalize to variants of this problem with wider triangles. For example, the triangle below has two more values on each row and from any point on a row you can move to the left, middle, or right of the row below it.
5
5 6 7
6 7 3 9 1

Sorting wrt to a column value in matlab [duplicate]

This question already has answers here:
Sorting entire matrix according to one column in matlab
(2 answers)
Closed 4 years ago.
I have multiple columns in my dataset and column 2 contains value from 1 till 7. I want to sort my dataset with respect to second column . Thanks in advance
The command you need is sortrows
By default this sorts with respect to the first column, but an additional argument can be used to change this to the 2nd (or 5th, 17th etc)
If A is your original array:
B = sortrows(A,2);
will give you the sorted array B w.r.t 2nd column
What did you mean by sort with respect to second column? You should be more specific or at least give us an example.
If you need a simple sort on each column use the following
A =
95 45 92 41 13 1 84
23 1 73 89 20 74 52
60 82 17 5 19 44 20
48 44 40 35 60 93 67
89 61 93 81 27 46 83
76 79 91 0 19 41 1
Sort each column of A in ascending order:
c = sort(A, 1)
c =
23 1 17 0 13 1 1
48 44 40 5 19 41 20
60 45 73 35 19 44 52
76 61 91 41 20 46 67
89 79 92 81 27 74 83
95 82 93 89 60 93 84

Equivalent of *nix fold in PowerShell

Today I had a few hundred items (IDs from SQL query) and needed to paste them into another query to be readable by an analyst. I needed *nix fold command. I wanted to take the 300 lines and reformat them as multiple numbers per line seperated by a space. I would have used fold -w 100 -s.
Similar tools on *nix include fmt and par.
On Windows is there an easy way to do this in PowerShell? I expected one of the *-Format commandlets to do it, but I couldn't find it. I'm using PowerShell v4.
See https://unix.stackexchange.com/questions/25173/how-can-i-wrap-text-at-a-certain-column-size
# Input Data
# simulate a set of 300 numeric IDs from 100,000 to 150,000
100001..100330 |
Out-File _sql.txt -Encoding ascii
# I want output like:
# 100001, 100002, 100003, 100004, 100005, ... 100010, 100011
# 100012, 100013, 100014, 100015, 100016, ... 100021, 100021
# each line less than 100 characters.
Depending on how big the file is you could read it all into memory, join it with spaces and then split on 100* characters or the next space
(Get-Content C:\Temp\test.txt) -join " " -split '(.{100,}?[ |$])' | Where-Object{$_}
That regex looks for 100 characters then the first space after that. That match is then -split but since the pattern is wrapped in parenthesis the match is returned instead of discarded. The Where removes the empty entries that are created in between the matches.
Small sample to prove theory
#"
134
124
1
225
234
4
34
2
42
342
5
5
2
6
"#.split("`n") -join " " -split '(.{10,}?[ |$])' | Where-Object{$_}
The above splits on 10 characters where possible. If it cannot the numbers are still preserved. Sample is based on me banging on the keyboard with my head.
134 124 1
225 234 4
34 2 42
342 5 5
2 6
You could then make this into a function to get the simplicity back that you are most likely looking for. It can get better but this isn't really the focus of the answer.
Function Get-Folded{
Param(
[string[]]$Strings,
[int]$Wrap = 50
)
$strings -join " " -split "(.{$wrap,}?[ |$])" | Where-Object{$_}
}
Again with the samples
PS C:\Users\mcameron> Get-Folded -Strings (Get-Content C:\temp\test.txt) -wrap 40
"Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua. Ut enim
ad minim veniam, quis nostrud exercitation
... output truncated...
You can see that it was supposed to split on 40 characters but the second line is longer. It split on the next space after 40 to preserve the word.
If it's one item per line, and you want to join every 100 items onto a single line separated by a space you could put all the output into a text file then do this:
gc c:\count.txt -readcount 100 | % {$_ -join " "}
When I saw this, the first thing that came to my mind was abusing Format-Table to do this, mostly because it knows how to break the lines properly when you specify a width. After coming up with a function, it seems that the other solutions presented are shorter and probably easier to understand, but I figured I'd still go ahead and post this solution anyway:
function fold {
[CmdletBinding()]
param(
[Parameter(ValueFromPipeline)]
$InputObject,
[Alias('w')]
[int] $LineWidth = 100,
[int] $ElementWidth
)
begin {
$SB = New-Object System.Text.StringBuilder
if ($ElementWidth) {
$SBFormatter = "{0,$ElementWidth} "
}
else {
$SBFormatter = "{0} "
}
}
process {
foreach ($CurrentObject in $InputObject) {
[void] $SB.AppendFormat($SBFormatter, $CurrentObject)
}
}
end {
# Format-Table wanted some sort of an object assigned to it, so I
# picked the first static object that popped in my head:
([guid]::Empty | Format-Table -Property #{N="DoesntMatter"; E={$SB.ToString()}; Width = $LineWidth } -Wrap -HideTableHeaders |
Out-String).Trim("`r`n")
}
}
Using it gives output like this:
PS C:\> 0..99 | Get-Random -Count 100 | fold
1 73 81 47 54 41 17 87 2 55 30 91 19 50 64 70 51 29 49 46 39 20 85 69 74 43 68 82 76 22 12 35 59 92
13 3 88 6 72 67 96 31 11 26 80 58 16 60 89 62 27 36 37 18 97 90 40 65 42 15 33 24 23 99 0 32 83 14
21 8 94 48 10 4 84 78 52 28 63 7 34 86 75 71 53 5 45 66 44 57 77 56 38 79 25 93 9 61 98 95
PS C:\> 0..99 | Get-Random -Count 100 | fold -ElementWidth 2
74 89 10 42 46 99 21 80 81 82 4 60 33 45 25 57 49 9 86 84 83 44 3 77 34 40 75 50 2 18 6 66 13
64 78 51 27 71 97 48 58 0 65 36 47 19 31 79 55 56 59 15 53 69 85 26 20 73 52 68 35 93 17 5 54 95
23 92 90 96 24 22 37 91 87 7 38 39 11 41 14 62 12 32 94 29 67 98 76 70 28 30 16 1 61 88 43 8 63
72
PS C:\> 0..99 | Get-Random -Count 100 | fold -ElementWidth 2 -w 40
21 78 64 18 42 15 40 99 29 61 4 95 66
86 0 69 55 30 67 73 5 44 74 20 68 16
82 58 3 46 24 54 75 14 11 71 17 22 94
45 53 28 63 8 90 80 51 52 84 93 6 76
79 70 31 96 60 27 26 7 19 97 1 59 2
65 43 81 9 48 56 25 62 13 85 47 98 33
34 12 50 49 38 57 39 37 35 77 89 88 83
72 92 10 32 23 91 87 36 41
This is what I ended up using.
# simulate a set of 300 SQL IDs from 100,000 to 150,000
100001..100330 |
%{ "$_, " } | # I'll need this decoration in the SQL script
Out-File _sql.txt -Encoding ascii
gc .\_sql.txt -ReadCount 10 | %{ $_ -join ' ' }
Thanks everyone for the effort and the answers. I'm really surprised there wasn't a way to do this with Format-Table without the use of [guid]::Empty in Rohn Edward's answer.
My IDs are much more consistent than the example I gave, so Noah's use of gc -ReadCount is by far the simplest solution in this particular data set, but in the future I'd probably use Matt's answer or the answers linked to by Emperor in comments.
I came up with this:
$array =
(#'
1
2
3
10
11
100
101
'#).split("`n") |
foreach {$_.trim()}
$array = $array * 40
$SB = New-Object Text.StringBuilder(100,100)
foreach ($item in $array) {
Try { [void]$SB.Append("$item ") }
Catch {
$SB.ToString()
[void]$SB.Clear()
[Void]$SB.Append("$item ")
}
}
#don't forget the last line
$SB.ToString()
1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101
1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101
1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101
1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101
1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101
1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101
1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101 1 2 3 10 11 100 101
Maybe not as compact as you were hoping for, and there may be better ways to do it, but it seems to work.