How can I efficiently group a large list of URLs by their host name in Perl? - perl

I have text file that contains over one million URLs. I have to process this file in order to assign URLs to groups, based on host address:
{
'http://www.ex1.com' => ['http://www.ex1.com/...', 'http://www.ex1.com/...', ...],
'http://www.ex2.com' => ['http://www.ex2.com/...', 'http://www.ex2.com/...', ...]
}
My current basic solution takes about 600 MB of RAM to do this (size of file is about 300 MB). Could you provide some more efficient ways?
My current solution simply reads line by line, extracts host address by regex and puts the url into a hash.
EDIT
Here is my implementation (I've cut off irrelevant things):
while($line = <STDIN>) {
chomp($line);
$line =~ /(http:\/\/.+?)(\/|$)/i;
$host = "$1";
push #{$urls{$host}}, $line;
}
store \%urls, 'out.hash';

One approach that you could take is tieing your url hash to a DBM like BerkeleyDB. You can explicitly give it options for how much memory it can use.

If you read 600MB from two files and store them in memory (in the hash) there is not much room for optimization in terms of memory use (short of compressing the data, which is probably not a viable option).
But depending on how you are going to use the data in the hash, it might be worth to consider storing the data in a database, and querying it for the information you need.
EDIT:
Based on the code you have posted, a quick optimization would be to not store the entire line but just the relative url. After all you already have the host name as a key in your hash.

Other than by storing your data structures to disk (tied DBM hash as suggested by Leon Timmermans, an SQL database such as SQLite3, etc.), you're not going to be able to reduce memory consumption much. 300M of actual data, plus the perl interpreter, plus the bytecode representation of your program, plus metadata on each of the extracted strings is going to add up to substantially more than 300M of total memory used if you keep it all in memory. If anything, I'm mildly surprised that it's only double the size of the input file.
One other thing to consider is that, if you're going to be processing the same file more than once, storing the parsed data structure on disk means that you'll never have to take the time to re-parse it on future runs of the program.

What exactly you are trying to acheive? If you are going for some complex analysis, storing to database is a good idea, of the grouping is just and intermediary step, you might just sort the text file and than process it sequentially directly deriving the results you are looking for.

Related

Serialize data to binary using multicore

I'm using the store function from the module Storable to get a binary representation of my hash. This hash is big enough for make the process last for 20min. Are there any similar function to store that works with multicore, so it gets the speed boosted?
I've searching for a while and I coulnd't find anything relevant, even using bson for the storage.
Finally I decided to split the data that I want to store in as many pieces as cores I have on the computer. So, I'm able to execute the store in threads making different output files, as ikegami suggested in the comments.

Perl DBM vs. Storable

for my current project i need to store a little database on disk, that i read once my program runs and write it once.
I have looked into perls DBM functionality and from what I understand it provides merely a hash that is stored on disk with every read and write going directly to disk.
My question is: Could I not simply use Storable or any of the related modules to achieve the same (a persistent hash) with far less File I/O overhead? (The hashes will never be to large to fit into memory easily)
Regards
Nick
SQLite is fast becoming the standard for simple on-disk databases. And in Perl you can just use DBD::SQLite and you're good to go.
Since the previous answers didn't really answer your actual question, "yes, you can"... with the following caveats:
Storable isn't really suited to concurrent access.
You will need to roll your own "atomic" update (ie: you will need to write to a tmp file, then rename).
If performance isn't really an issue, you could also use Data::Dumper (with the resulting file being somewhat human readable).
You could splat the contents to CSV.
I often use Dumper when there is only going to be a single task accessing the file - and it gives me a way to read/modify the contents if I see fit.

Using combination of File Size and Hash value of only first 20KB of a file to detect duplicates?

A project I'm working on requires detection of duplicate files. Under normal circumstances I would simply compare the file bytes in blocks or hash value of the entire file contents. However, the system does not have access to the entire file - only the first 50KB or so. It also knows the total file size of the original file.
I was thinking of implementing the following: each time a file is added, I would look for possible duplicates using both the total file size and a hash calculation of (file-size)+(first-20KB-of-file). The hash algorithm itself is not the issue at this stage, but will likely be MurmurHash2.
Another option is to also store, say, bytes 9000 through 9020 and use that as a third condition when looking up a duplicate copy or alternatively to compare byte-to-byte when the aforementioned lookup method returns possible duplicates in a last attempt to discard false positives.
How naive is my proposed implementation? Is there a reliable way to predict the amount of false positives? What other considerations should I be aware of?
Edit: I forgot to mention that the files are generally going to be compressed archives (ZIP,RAR) and on occasion JPG images.
You can use file size, hashes and partial-contents to quickly detect when two files are different, but you can only determine if they are exactly the same by comparing their contents in full.
It's up to you to decide whether the failure rate of your partial-file-check will be low enough to be acceptable in your specific circumstances. Bearing in mind that even an "exceedingly unlikely" event will happen frequently if you have enough volume. But if you know the type of data that the files will contain, you can judge the chances of two near-identical files (idenitcal in the first 50kB) cropping up.
I would think that if a partial-file-match is acceptable to you, then a hash of those partial file contents is probably going to be pretty acceptable too.
If you have access to 50kB then I'd use all 50kB rather than only the first 20kB in my hash.
Picking an arbitrary 20 bytes probably won't help much (your file contents will either be very different in which case hash+size clashes will be unlikely, or they will be very similar in which case the chances of a randomly chosen 20 bytes being different will be quite low)
In your circumstances I would check the size, then a hash of the available daa (50kB), then if this suggests a file match, a brute-force comparison of the available data just to minimise the risks, if you don't expect to be adding so many duplicates that this would bog the system down.
It depends on the file types, but in most cases false positives will be pretty rare.
You probably won't have any in Office and graphical files. And executables are supposed have a checksum in the header.
I'd say that the most likely false positive you may encounter is in source code files. They change often and it may happen that a programmer replaces a few symbols something after the first 20K.
Other than that I'd say they are pretty unlikely.
Why don't use a hash of the first 50 KB, and then store the size on the side? That would give you the most security with what you have to work with (with that said, there could be totally different content in the files after the first 50 KB without you knowing, so it's not a really secure system).
I find it difficult. It's likely that you would catch most duplicates with this method, but the possibility of false positives is huge. What about two versions of a 5MB XML document whose last chapter is modified?

How can I search a large sorted file in Perl?

Can you suggest me any CPAN modules to search on a large sorted file?
The file is a structured data about 15 million to 20 million lines, but I just need to find about 25,000 matching entries so I don't want to load the whole file into a hash.
Thanks.
Perl is well-suited to doing this, without the need for an external module (from CPAN or elsewhere).
Some code:
while (<STDIN>) {
if (/regular expression/) {
process each matched line
}
}
You'll need to come up with your own regular expression to specify which lines you want to match in your file. Once you match, you need your own code to process each matched line.
Put the above code in a script file and run it with your file redirected to stdin.
A scan over the whole file may be the fastest way. You can also try File::Sorted, which will do a binary search for a given record. Locating one record in a 25 million line file should require about 15-20 seeks for each record. This means that to search for 25,000 records, you would only need around .5 million seeks/comparison, compared to 25,000,000 to naively examine each row.
Disk IO being what it is, you may want to try the easy way first, but File::Sorted is a theoretical win.
You don't want to search the file, so do what you can to avoid it. We don't know much about your problem, but here are some tricks I've used in previous problems, all of which try to do work ahead of time:
Break up the file into a database. That could be SQLite, even.
Pre-index the file based on the data that you want to search.
Cache the results from previous searches.
Run common searches ahead of time, automatically.
All of these trade storage space to for speed. Some some these I would set up as overnight jobs so they were ready for people when they came into work.
You mention that you have structured data, but don't say any more. Is each line a complete record? How often does this file change?
Sounds like you really want a database. Consider SQLite, using Perl's DBI and DBD::SQLite modules.
When you process an input file with while ( <$filehandle> ), it only takes the file one line at a time (for each iteration of the loop), so you don't need to worry about it clogging up your memory. Not so with a for loop, which slurps the whole file into memory. Use a regex or whatever else to find what you're looking for and put that in a variable/array/hash or write it out to a new file.

Optimizing word count

(This is rather hypothetical in nature as of right now, so I don't have too many details to offer.)
I have a flat file of random (English) words, one on each line. I need to write an efficient program to count the number of occurrences of each word. The file is big (perhaps about 1GB), but I have plenty of RAM for everything. They're stored on permanent media, so read speeds are slow, so I need to just read through it once linearly.
My two off-the-top-of-my-head ideas were to use a hash with words => no. of occurrences, or a trie with the no. of occurrences at the end node. I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
What approach would be best?
I think a trie with the count as the leaves could be faster.
Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.
A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.
For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.
If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.
However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.
I'd use a Dictionary object where the key is word converted to lower case and the value is the count. If the dictionary doesn't contain the word, add it with a value of 1. If it does contain the word, increment the value.
Given slow reading, it's probably not going to make any noticeable difference. The overall time will be completely dominated by the time to read the data anyway, so that's what you should work at optimizing. For the algorithm (mostly data structure, really) in memory, just use whatever happens to be most convenient in the language you find most comfortable.
A hash table is (if done right, and you said you had lots of RAM) O(1) to count a particular word, while a trie is going to be O(n) where n is the length of the word.
With a sufficiently large hash space, you'll get much better performance from a hash table than from a trie.
I think that a trie is overkill for your use case. A hash of word => # of occurrences is exactly what I would use. Even using a slow interpreted language like Perl, you can munge a 1GB file this way in just a few minutes. (I've done this before.)
I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
How many times will this code be run? If you're just doing it once, I'd say optimize for your time rather than your CPU's time, and just do whatever's fastest to implement (within reason). If you have a standard library function that implements a key-value interface, just use that.
If you're doing it many times, then grab a subset (or several subsets) of the data file, and benchmark your options. Without knowing more about your data set, it'd be dubious to recommend one over another.
Use Python!
Add these elements to a set data type as you go line by line, before asking whether it is in the hash table. After you know it is in the set, then add a dictionary value of 2, since you already added it to the set once before.
This will take some of the memory and computation away from asking the dictionary every single time, and instead will handle unique valued words better, at the end of the call just dump all the words that are not in the dictionary out of the set with a value of 1. (Intersect the two collections in respect to the set)
To a large extent, it depends on what you want you want to do with the data once you've captured it. See Why Use a Hash Table over a Trie (Prefix Tree)?
a simple python script:
import collections
f = file('words.txt')
counts = collections.defaultdict(int)
for line in f:
counts[line.strip()] +=1
print "\n".join("%s: %d" % (word, count) for (word, count) in counts.iteritems())