How to read a large number of LDAP entries in perl? - perl

I already have an LDAP script in order to read LDAP user information one by one. My problem is that I am returning all users found in Active Directory. This will not work because currently our AD has around 100,000 users causing the script to crash due to memory limitations.
What I was thinking of doing was to try to process users by batches of X amount of users and if possible, using threads in order to process some users in parallel. The only thing is that I have just started using Perl, so I was wondering if anyone could give me a general idea of how to do this.

If you can get the executable ldapsearch to work in your environment (and it does work in *nix and Windows, although the syntax is often different), you can try something like this:
my $LDAP_SEARCH = "ldapsearch -h $LDAP_SERVER -p $LDAP_PORT -b $BASE -D uid=$LDAP_USERNAME -w $LDAP_PASSWORD -LLL";
my #LDAP_FIELDS = qw(uid mail Manager telephoneNumber CostCenter NTLogin displayName);
open (LDAP, "-|:utf8", "$LDAP_SEARCH \"$FILTER\" " . join(" ", #LDAP_FIELDS));
while (<LDAP>) {
# process each LDAP response
}
I use that to read nearly 100K LDAP entries without memory problems (although it still takes 30 minutes or more). You'll need to define $FILTER (or leave it blank) and of course all the LDAP server/username/password pieces.
If you want/need to do a more pure-Perl version, I've had better luck with Net::LDAP instead of Net::LDAP::Express, especially for large queries.

Related

How can I use tar and tee in PowerShell to do a read once, write many, raw file copy

I'm using a small laptop to copy video files on location to multiple memory sticks (~8GB).
The copy has to be done without supervision once it's started and has to be fast.
I've identified a serious boundary to the speed, that when making several copies (eg 4 sticks, from 2 cameras, ie 8 transfers * 8Gb ) the multiple Reads use a lot of bandwidth, especially since the cameras are USB2.0 interface (two ports) and have limited capacity.
If I had unix I could use tar -cf - | tee tar -xf /stick1 | tee tar -xf /stick2 etc
which means I'd only have to pull 1 copy (2*8Gb) from each camera once, on the USB2.0 interface.
The memory sticks are generally on a hub on the single USB3.0 interface that is driven on different channel so write sufficently fast.
For reasons, I'm stuck using the current Win10 PowerShell.
I'm currently writing the whole command to a string (concatenating the various sources and the various targets) and then using Invoke-Process to execute the copy process while I'm entertaining and buying the rounds in the pub after the shoot. (hence the necessity to be afk).
I can tar cf - | tar xf a single file, but can't seem to get the tee functioning correctly.
I can also successfully use the microSD slot to do a single cameras card which is not as physically nice but is fast on one cameras recording, but I still have the bandwidth issue on the remaining camera(s). We may end up with 4-5 source cameras at the same time which means the read once, write many, is still going to be an issue.
Edit: I've just advanced to play with Get-Content -raw | tee \stick1\f1 | tee \stick2\f1 | out-null . Haven't done timings or file verification yet....
Edit2: It seems like the Get-Content -raw works properly, but the functionality of PowerShell pipelines violates two of the fundamental Commandments of programming: A program shall do one thing and do it well, Thou shalt not mess with the data stream.
For some unknown reason PowerShell default (and only) pipeline behaviour always modifies the datastream it is supposed to transfer from one stream to the next. Doesn't seem to have a -raw option nor does it seem to have a $session or $global I can set to remedy the mutilation.
How do PowerShell people transfer raw binary from one stream out, into the next process?
May be not quite what you want (if you insist on using built in Powershell commands), but if you care about speed, use streams and asynchronous Read/Write. Powershell is a great tool because it can use any .NET classes seamlessly.
The script below can easily be extended to write to more than 2 destinations and can potentially handle arbitrary streams. You might want to add some error handling via try/catch there too. You may also try to play with buffered streams with various buffer size to optimize the code.
Some references:
FileStream.ReadAsync
FileStream.WriteAsync
CancellationToken
Task.GetAwaiter
-- 2021-12-09 update: Code is modified a little to reflect suggestions from comments.
# $InputPath, $Output1Path, $Output2Path are parameters
[Threading.CancellationTokenSource] $cancellationTokenSource = [Threading.CancellationTokenSource]::new()
[Threading.CancellationToken] $cancellationToken = $cancellationTokenSource.Token
[int] $bufferSize = 64*1024
$fileStreamIn = [IO.FileStream]::new($inputPath,[IO.FileMode]::Open,[IO.FileAccess]::Read,[IO.FileShare]::None,$bufferSize,[IO.FileOptions]::SequentialScan)
$fileStreamOut1 = [IO.FileStream]::new($output1Path,[IO.FileMode]::CreateNew,[IO.FileAccess]::Write,[IO.FileShare]::None,$bufferSize)
$fileStreamOut2 = [IO.FileStream]::new($output2Path,[IO.FileMode]::CreateNew,[IO.FileAccess]::Write,[IO.FileShare]::None,$bufferSize)
try{
[Byte[]] $bufferToWriteFrom = [byte[]]::new($bufferSize)
[Byte[]] $bufferToReadTo = [byte[]]::new($bufferSize)
$Time = [System.Diagnostics.Stopwatch]::StartNew()
$bytesRead = $fileStreamIn.read($bufferToReadTo,0,$bufferSize)
while ($bytesRead -gt 0){
$bufferToWriteFrom,$bufferToReadTo = $bufferToReadTo,$bufferToWriteFrom
$writeTask1 = $fileStreamOut1.WriteAsync($bufferToWriteFrom,0,$bytesRead,$cancellationToken)
$writeTask2 = $fileStreamOut2.WriteAsync($bufferToWriteFrom,0,$bytesRead,$cancellationToken)
$readTask = $fileStreamIn.ReadAsync($bufferToReadTo,0,$bufferSize,$cancellationToken)
$writeTask1.Wait()
$writeTask2.Wait()
$bytesRead = $readTask.GetAwaiter().GetResult()
}
$time.Elapsed.TotalSeconds
}
catch {
throw $_
}
finally{
$fileStreamIn.Close()
$fileStreamOut1.Close()
$fileStreamOut2.Close()
}

Perl interface with Aspell

I am trying to identify misspelled words with Aspell via Perl. I am working on a Linux server without administrator privileges which means I have access to Perl and Aspell but not, for example, Text::Aspell which is a Perl interface for Aspell.
I want to do the very simple task of passing a list of words to Aspell and having it return the words that are misspelled. If the words I want to check are "dad word lkjlkjlkj" I can do this through the command line with the following commands:
aspell list
dad word lkjlkjlkj
Aspell requires CTRL + D at the end to submit the word list. It would then return "lkjlkjlkj", as this isn't in the dictionary.
In order to do the exact same thing, but submitted via Perl (because I need to do this for thousands of documents) I have tried:
my $list = q(dad word lkjlkjlkj):
my #arguments = ("aspell list", $list, "^D");
my $aspell_out=`#arguments`;
print "Aspell output = $aspell_out\n";
The expected output is "Aspell output = lkjlkjlkj" because this is the output that Aspell gives when you submit these commands via the command line. However, the actual output is just "Aspell output = ". That is, Perl does not capture any output from Aspell. No errors are thrown.
I am not an expert programmer, but I thought this would be a fairly simple task. I've tried various iterations of this code and nothing works. I did some digging and I'm concerned that perhaps because Aspell is interactive, I need to use something like Expect, but I cannot figure out how to use it. Nor am I sure that it is actually the solution to my problem. I also think ^D should be an appropriate replacement for CTRL+D at the end of the commands, but all I know is it doesn't throw an error. I also tried \cd instead. Whatever it is, there is obviously an issue in either submitting the command or capturing the output.
The complication with using aspell out of a program is that it is an interactive and command-line driver tool, as you suspect. However, there is a simple way to do what you need.
In order to use aspell's command list one needs to pass it words via STDIN, as its man page says. While I find the GNU Aspell manual a little difficult to get going with, passing input to a program via its STDIN is easy enough and we can rewrite the invocation as
echo dad word lkj | aspell list
We get lkj printed back, as due. Now this can run out of a program just as it stands
my $word_list = q(word lkj good asdf);
my $cmd = qq(echo $word_list | aspell list);
my #aspell_out = qx($cmd);
print for #aspell_out;
This prints lines lkj and asdf.
I assemble the command in a string (as opposed to an array) for specific reasons, explained below. The qx is the operator form of backticks, which I prefer for its far superior readability.
Note that qx can return all output in a string, if in scalar context (assigned to a scalar for example), or in a list when in list context. Here I assign to an array so you get each word as an element (alas, each also comes with a newline, so may want to do chomp #aspell_out;).
Comment on a list vs string form of a command
I think that it's safe to recommend to use a list-form for a command, in general. So we'd say
my #cmd = ('ls', '-l', $dir); # to be run as an external command
instead of
my $cmd = "ls -l $dir"; # to be run as an external command
The list form generally makes it easier to manage the command, and it avoids the shell altogether.
However, this case is a little different
The qx operator doesn't really behave differently -- the array gets concatenated into a string, and that runs. The very fact that we can pass it an array is incidental, and not even documented
We need to pipe input to aspell's STDIN, and shell does that for us simply. We can use a shell with command's LIST form as well, but then we'd need to invoke it explicitly. We can also go for aspell's STDIN by means other than the shell but that's more complex
With a command in a list the command name must be the first word, so that "aspell list" from the question is wrong and it should fail (there is no command named that) ... except that in this case it wouldn't (if the rest were correct), since for qx the array gets collapsed into a string
Finally, apsell nicely exposes its API in a C library and that's been utilized for the module you mention. I'd suggest to install it as a user (no privileges needed) and use that.
You should take a step back and investigate if you can install Text::Aspell without administrator privilige. In most cases that's perfectly possible.
You can install modules into your home directory. If there is no C-compiler available on the server you can install the module on a compatible machine, compile and copy the files.

Benchmarking in BaseX: how to set up

Currently I am an intern at a research group that makes large sets of texts (corpora) searchable. Not only can one search for literal strings, but more importantly it is also possible to look for similar syntactical dependency structures as the given input, without the need of being proficient in any programming language or corpus annotation style. It may be clear that this tool is intended for linguists.
At the start of the project - before I was engaged in the project - the tool was limited to rather small corpora (up to 9 million words). The goal is to make large sets of texts searchable as well. We are talking about +- 500 millions words. Attempts have been made that in theory ought to improve speed by reducing the search space (see this paper) but this has not been tested yet. The results of this attempt is a new file structure. Let's call this structure B, compared to a non-processed structure A. We expect B to provide faster results when queried with BaseX.
My question is: what is the best way to test and compare both approaches with a Perl script? Below you find my current script to query BaseX locally. It takes two arguments. A directory that stores different files. These files each individually store XPaths. Those XPaths are the ones that I have selected to benchmark with. A second argument is the limit of results to return. When set to zero, no limit is set.
Because some parts of the dataset are so incredibly huge, we have divided them in different, equally sized files as well, called treebankparts. They are stored in <tb> tags inside treebankparts.lst.
#!/usr/bin/perl
use warnings;
$| = 1; # flush every print
# Directory where XPaths are stored
my $directory = shift(#ARGV);
# Set limit. If set to zero all results will be returned
my $limit = shift(#ARGV);
# Create session, connect to BaseX
my $session = Session->new([INFORMATION WITHHELD]);
# List all files in directory
#xpathfiles = <$directory/*.txt>;
# Read lines of treebank parts into variable
open( my $tfh, "treebankparts.lst" ) or die "cannot open file treebankparts.lst";
chomp( my #tlines = <$tfh> );
close $tfh;
# Loop through all XPaths in $directory
foreach my $xpathfile (#xpathfiles) {
open( my $xfh, $xpathfile ) or die "cannot open file $xpathfile";
chomp( my #xlines = <$xfh> );
close $xfh;
print STDOUT "File = $xpathfile\n";
# Loop through lines from XPath file (= XPath query)
foreach my $xline (#xlines) {
# Loop through the lines of treebank file
foreach my $tline (#tlines) {
my ($treebank) = $tline =~ /<tb>(.+)<\/tb>/;
QuerySonar( $xline, $treebank );
}
}
}
$session->close();
sub QuerySonar {
my ( $xpath, $db ) = #_;
print STDOUT "Querying $db for $xpath\n";
print STDOUT "Limit = $limit\n";
my $x_limit;
my $x_resultsofxp = 'declare variable $results := db:open("' . $db . '")/treebank/alpino_ds'
. $xpath . ';';
my $x_open = '<results>';
my $x_totalcount = '<total>{count($results)}</total>';
my $x_loopinit = '{for $node at $limitresults in $results';
# Spaces are important!
if ( $limit > 0 ) {
$x_limit = ' where $limitresults <= ' . $limit . ' ';
}
# Comment needed to prevent `Incomplete FLWOR expression`
else { $x_limit = '(: No limit set :)'; }
my $x_sentenceinfo = 'let $sentid := ($node/ancestor::alpino_ds/#id)
let $sentence := ($node/ancestor::alpino_ds/sentence)
let $begin := ($node//#begin)
let $idlist := ($node//#id)
let $beginlist := (distinct-values($begin))';
# Separate sentence info by tab
my $x_loopexit = 'return <match>{data($sentid)}
{string-join($idlist, "-")}
{string-join($beginlist, "-")}
{data($sentence)}</match>}';
my $x_close = '</results>';
# Concatenate all XQuery parts
my $x_concatquery =
$x_resultsofxp
. $x_open
. $x_totalcount
. $x_loopinit
. $x_limit
. $x_sentenceinfo
. $x_loopexit
. $x_close;
my $querysent = $session->query($x_concatquery);
my $basexoutput = $querysent->execute();
print $basexoutput. "\n\n";
$querysent->close();
}
(Note that this is a stripped down version and that it may not work as-is. This snippet does not use structure B!)
What happens is: loop through all XPath files, loop through each line in an XPath file, loop through all treebankparts and then execute the sub. The sub then queries BaseX. This comes down to sending a new XQuery to BaseX, and returning the total hits and the results (possibly limited by an argument in the Perl script). So I got that going, but now the question is: how can I improve this script so I can get some benchmarking results out of it.
First of all, I'd start with adding a profiler to this script. I guess that bit is obvious. However, I am not sure how I should start comparing structure A with B. Would I put both queries (to different databases) in separate scripts, then call a profiler on both, and run both scripts a number of times and get a mean value and compare? Or would I run each query by both databases in the same script, almost at the same time?
It is important to consider caching that is happening. Therefore I am not entirely sure what build-up for benchmarking of a database this huge is appropriate. First one script, then the other. Both at the same time. Alternating queries between the two. And so on. There are so many possibilities, but I wonder which would provide the best results. Also, I would repeat the process a couple of times. Would I repeat each query and then continue to the next, or finish all XPath files, and then repeat the whole process again?
(Reading the description of the benchmark-tag I am confident that this - albeit elaborate - post is suited for SO.)
One possible improvement: minimize the number of times you transfer control from Perl to the database -- just as you have minimized the number of database connections. (Or at least set yourself up to measure the cost of the transfer of control.) I suspect you will get significantly better results if you move your loop into XQuery rather than running the loop in Perl.
A single call to a database management system asking it to perform 1000 searches is likely to be somewhat faster than 1000 calls to the DBMS each requesting a single search. The first involves two context switches: one from your script or bash to the dbms, and one back; the second involves 2000. The last time I measured something like this carefully, each context switch cost something like 500 ms; it mounted up fast. (That said, this was a long time ago, with a different database. But it was surprising [and sobering] to learn than the difference between the two query formulations I was trying to compare was dwarfed by the difference between running the test loop in a script or inside the dbms.)
A second suggestion: From what you say, the size of the database and the result sets seem likely to ensure that caching between runs doesn't have a big effect on the results. But this seems to be a testable assertion, and one worth testing. So set up your A and B scripts, and then do a trial run: does for runcount in 1 2 3 4 5; do perl A.pl; perl B.pl; done produce results comparable to for runcount in 1 2 3 4 5; do perl A.pl; done; for runcount in 1 2 3 4 5; do perl B.pl; done? If they are comparable, then you have reason to believe it doesn't matter if you run A and B separately or in alternation. If they are not comparable, then you know it does matter, which would be very valuable information. Other things being equal, I would expect caching to produce lower times when running one query several times before moving on to the next query, and cache misses to produce higher times if running each query just once. Probably worth measuring.
In the same spirit, I would recommend that you run tests both with the loop in the Perl script and with the loop in an XQuery query.
A third suggestion: in practice, a query at the corpus query interface will involve several stages, each with measurable time: transmission of the query from the user's browser (assuming it's a Web interface) to the server, translation of the request into a form suitable for transmission to the back end dbms (here BaseX), context switch to BaseX, processing within BaseX, context switch back, handling by web server, transmission to user. It would be useful to have at least rough estimates of the times involved in each of these steps, or at least of the time taken for everything-but-BaseX.
So if it were me running the tests, I think I'd also prepare a set of vacuous XQuery tests, along the lines of
2 + 3
or just
42
to push the BaseX time as close to zero as possible; the measured time between user initiation of the query and display of response is the per-query overhead. (Interesting question: should one use many different trivial expressions to prevent caching of results, or should one use the same expression over and over, to encourage caching of the result? How can we try to ensure that BaseX will cache the result, but the Web server won't? ...)
A final suggestion: remember that other people who need to do benchmarking will often have the same questions as you do. This means that you can reformulate every question of the form "Should I do X or Y?" into the form "What measurable effect does the difference between X and Y have on the results of a benchmarking test?" Run some tests to try to measure that effect, and write them up. (I always find it makes it more interesting if I force myself to make a prediction after formulating the question but before measuring the difference.)
There are several things we have to separate here: The first issue is that BaseX performance should not be confused with your perl script as your perl script seems to simply construct an XQuery (and not XPath as you suggested in your question and tags). So for testing I would suggest to use some already pre-fined XQueries suitable to your real-world scenarios, as your XQuery construction should be negligible. How you pass your query to BaseX, so via the Perl API or via any other means should not be relevant. Even if your perl performance is relevant, you should test the performance separately.
Hence, your original question whether you should put test both scenarios in the same script or not is not relevant anymore. Instead you simply execute the two separate XQueries for the scenario A and B by themself without the perl script.
You are partly correct to worry about caching, however it is the Java JIT compiler which most likely will be relevant here (as BaseX is written in java, JIT and use caching, not BaseX itself. You should therefore use the Client/Server infrastructure and have a long-running server and warm it up before running performance measurements.
Regarding performance: The BaseX GUI and also the command line already have an included measurement (using command line you can set -V to get run times for parsing, compiling, evaluating and printing). Also, using the -r parameter you can execute a query multiple times and it will give you the average execution times.
In general, if you want to improve the performance of your script you should take a look at the query plan and the optimized query and check whether the appropriate indexes are used. Also, our new Selective Indexing might be very useful to you. If the index isn't used, your query will definitely not perform well for 500 million words.
Full Disclosure: I am with the BaseX team and you might get better help at the BaseX mailing list or might want to reference this questions as our head architect isn't watching SO as regularly as the ML.

how to get list of POSIX group members in Perl

Is there any way to get a list of all the members of a POSIX group in Perl?
I can't use getgrent() and similar because it returns the list as a space delimited string, and some usernames can have spaces in them.
I have to handle spaces in user and group names, because I'm working in an AD environment that other organizations can create users and groups in, so I'm trying to account for possible edge cases.
I'd say just use getgrent() and don't worry about spaces.
It may be possible to create a user name with one or more spaces in it, perhaps by manually editing /etc/passwd, but it's going to cause other problems as well. For example, ~foo is foo's home directory, but ~foo bar isn't foo bar's home directory.
On Linux, the useradd and adduser commands don't even permit spaces in file names. On Linux Mint 14 (based on Ubuntu 12.10):
$ sudo adduser 'foo bar'
adduser: To avoid problems, the username should consist only of
letters, digits, underscores, periods, at signs and dashes, and not start with
a dash (as defined by IEEE Std 1003.1-2001). For compatibility with Samba
machine accounts $ is also supported at the end of the username
$ sudo useradd !$
sudo useradd 'foo bar'
useradd: invalid user name 'foo bar'
$
Do you actually have user names with spaces on your system?
UPDATE: I've found that it actually is possible to create user names with spaces. useradd and adduser don't allow it (and you should be using one of those commands, or something similar, to create new accounts). But if I manually edit /etc/passwd using sudo vipw, I can create a user named foo bar, and I can do:
su - 'foo bar'
ssh 'foo bar#localhost'
etc. But it's a Really Bad Idea. Perl's getgr*() cannot tell whether a group contains one entry for foo bar or two entries for foo and bar (which is what you're asking about), and I can't use the shell's ~name syntax to refer to the account's home directory. I could use other methods to get both pieces of information, but it's much easier to avoid creating such an account in the first place.
If you're seriously concerned about some admin being foolish enough to create such an account, then you can use some of the alternative methods that have been discussed. But as I said, I don't think it's worth the effort.
(Perl could have avoided this problem by delimiting the list with : characters rather than spaces, since those are actually incompatible with the format of /etc/passwd and /etc/group, which the system depends on. But it's too late to change it now.)
UPDATE 2:
As you say in a comment (which I've edited into your question):
I have to handle spaces in user and group names, because I'm working in an AD environment that other organizations can create users and groups in, so I'm trying to account for possible edge cases.
Your solution from the same comment:
map((getgrgrid($_))[0], split(/ /, `id -G $username`))
is probably the best workaround. (id -G prints numeric group ids; which obviously can't contain spaces.)
It's probably also worth checking whether you actually have user or group names with spaces in them (though of course that doesn't guard against such names being added in the future). I wonder how your POSIX system actually deals with such names. I wouldn't be astonished if they're automatically translates them somehow. Even so, your id -G solution will still work.
If I have /etc/group:
...
postgres:x:26:
fsniper:x:481:
clamupdate:x:480:
some spacey group:x:482:saml, some spacey user
I can use the following commands to see this group's members:
% getent group
...
postgres:x:26:
fsniper:x:481:
clamupdate:x:480:
some spacey group:x:482:saml,some spacey user
Or if you know the specific group that you're interested in:
% getent group "some spacey group"
some spacey group:x:482:saml,some spacey user
These could be wrapped inside of a Perl script like this:
#!/usr/bin/perl
use feature qw(say);
chomp (my $getent = `getent group "some spacey group" | sed 's/.*://'`);
my #users = split(/,/, $getent);
foreach my $i (#users) { say $i; }
Running it:
% ./b.pl
saml
some spacey user
Resources
How to list all users in a Linux group?

one million records in log file for a online shopping website. FInd distinct IP addresses [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
It is a technical test question. Given 1 million records in a log file. These are records of website hits of an online shopping website. The records are of type:
TimeStamp:Date,time ; IP address; ProductName
Find the distinct IP addresses and most popular product. What is the most efficient way to do this? One solution is hashing. If solution is hashing, please provide a explanation for efficiently hashing this since there are one million records.
I recently did something similar for homework, not sure on the total number of lines, but it was quite a lot. The point is that your computer can probably do this very quickly, even if there is a million records.
I agree with you on the hashtable, and I would do the two questions slightly differently.
The first one I would check every ip against the hashtable, and if it exists, do nothing. If it does not exist, add it to the hashtable, and increment a counter. At the end of the program, the counter will tell you how many unique IP's there were.
The second I would hash the product name and put that in the hashtable. I would increment the value associated with the hashkey every time I found a match in the table. At the end, loop through all the keys and values of the hashtable and find the highest value. That is the most popular product.
A million log records is really a very small number; just read them in and keep a set of the IP addresses and a dict from product names to number of mentions -- you don't mention any specific language constraint so I assume a language that will do (implicitly) excellent hashing of such strings on your behalf is acceptable (Perl, Python, Ruby, Java, C#, etc, all have fine facilities for the purpose).
E.g., in Python:
import collections
import heapq
ipset = set()
prodcount = collections.defaultdict(int)
numlines = 0
for line in open('logfile.txt', 'r'):
timestamp, ip, product = line.strip().split(';')
ipset.add(ip)
prodcount[product] += 1
numlines += 1
print "%d distinct IP in %d lines" % (len(ipset), numlines)
print "Top 10 products:"
top10 = heapq.nlargest(10, prodcount, key=prodcount.get)
for product in top10:
print "%6d %s" % (prodcount[product], product)
Distinct IP addresses:
$ cut -f 2 -d \; | sort | uniq
Most popular product:
$ cut -f 3 -d \; | sort | uniq -c | sort -n
If you can do so, shell script it like that.
First of all, one million lines is not at all a huge file.
A simple Perl script can chew a 2.7 million lines script in 6 seconds, without having to think much about the algorithm.
In any case hashing is the way to go and, as shown, there's no need to bother with hashing over an integer representation.
If we were talking about a really huge file, then I/O would become the bottleneck and thus the hashing method gets less and less relevant as the file grows bigger.
Theoretically in a language like C it would probably be faster to hash over an integer than over a string, but I doubt that in a language suited to this task that would really make a difference. Things like how to read the file efficiently would matter much much more.
Code
vinko#parrot:~$ more hash.pl
use strict;
use warnings;
my %ip_hash;
my %product_hash;
open my $fh, "<", "log2.txt" or die $!;
while (<$fh>) {
my ($timestamp, $ip, $product) = split (/;/,$_); #To fix highlighting
$ip_hash{$ip} = 1 if (!defined $ip_hash{$ip});
if (!defined $product_hash{$product}) {
$product_hash{$product} = 1
} else {
$product_hash{$product} = $product_hash{$product} + 1;
}
}
for my $ip (keys %ip_hash) {
print "$ip\n";
}
my #pkeys = sort {$product_hash{$b} <=> $product_hash{$a}} keys %product_hash;
print "Top product: $pkeys[0]\n";
Sample
vinko#parrot:~$ wc -l log2.txt
2774720 log2.txt
vinko#parrot:~$ head -1 log2.txt
1;10.0.1.1;DuctTape
vinko#parrot:~$ time perl hash.pl
11.1.3.3
11.1.3.2
10.0.2.2
10.1.2.2
11.1.2.2
10.0.2.1
11.2.3.3
10.0.1.1
Top product: DuctTape
real 0m6.295s
user 0m6.230s
sys 0m0.030s
I also would read the file into a database, and would link it to another table of log filenames and date/time imported.
This is because in the real world you're going to need to do this regularly. The company is going to want to be able to check trends over time, so you're quickly going to be asked questions like "is that more or less unique IP addresses than last month's log file?" and "how are the most popular products changing from week to week".
In my experience the best way to answer these questions in an interview scenario as you describe is to show awareness of real-world situations. A tool to parse the log files (produced daily? weekly? monthly?) and read them into a database where some queries, graphs etc can pull all the data out, especially across multiple log files, will take a bit longer to write but be infinitely more useful and useable.
In the real world, if this is a once-off or occasional thing, I would simply insert the data into a database and run a few basic queries.
Since this is a homework assignment though, that's probably not what the prof is looking for. An IP address is really just a 32-bit number. I could convert each IP to it's 32-bit equivalent instead of creating a hash.
Since this is homework, the rest "is left as an exercise to the reader."
Like other people write, there are just 2 hashtables. One for IP's and one for Products. You can count the occurrences for both, but you only care about them in the latter "Product Popularity"
The key to hashing is having an efficient hash key, and hash efficiency means that the keys are evenly distributed. Poor key choice means that there will be many collisions, and performance of the hash table will suffer.
Being lazy, I'd be tempted to create a Dictionary<IPAddress,int> and hope that the implementor of the IPAddress class created the hash key appropriately.
Dictionary<IPAddress, int> ipAddresses = new Dictionary<IPAddress, int>();
Dictionary<string, int> products = new Dictionary<string,int>();
For the Products list, after you've setup the hash table, just use linq to select them out.
var sorted = from o in products
orderby o.Value descending
where o.Value > 1
select new { ProductName = o.Key, Count = o.Value };
Just sort the file by each of the two fields of interest. That avoids any need to worry about hash functions and will work just fine on a million record set.
Sorting the IP addresses this way also makes it easy to extract other interesting information such as accesses from the same subnet.
Unique IPs:
$ awk -F\; '{print $2}' log.file | sort -u
count.awk
{ a[$0]++ }
END {
for(key in a) {
print a[key] " " key;
}
}
Top 10 favorite items:
$ awk -F\; '{print $3}' log.file | awk -f count.awk | sort -r -n | top