How to hash a filename down to a small number or digit for output processing - perl

I am not a Perl programmer but I've inherited existing code that is going to a directory, finding all files iren that folder and subfolder (usually JPG or Office files) and then converting this into a single file to use to load into a SQL Server database. The customer has about 500,000 of these files.
It takes about 45 mins to create the file and then another 45 mins for SQL to load the data. Crudely, it's doing about 150 per second which is reasonable but time is the issue for the job. There are many reasons I don't want to use other techniques so please don't suggest other options unless closely aligned to this process.
What I was considering is to improve speed by running something like 10 processes concurrent. Each process would get passed another argument (0-9). Each process would go to the directory and find all files as it is currently doing but for each file found, it would hash or kludge the filename down to a single digit (0-9) and if that matched the supplied argument, the process would process that file and write it out to it's unique file stream.
Then I would have 10 output files at the end. I doubt that the SQL Server side could be improved as I would have to load to separate tables and then merge in the database and as these are BLOB objects, will not be fast.
So I am looking for some basic code or clues on what functions to use in Perl to take a variable (the file name $File) and generate a single 0 to 9 value based on that. It is probably done by getting the ascii value of each char, then adding these together to get a long number, then add these individual numbers together and eventually you'll get an answer.
Any clues or suggested techniques?

Here's an easy one to implement suggested in the unpack function documentation:
sub string_to_code {
# convert an arbitrary string to a digit from 0-9
my ($string) = #_;
return unpack("%32W*",$string) % 10;
}

Related

How to fix out of memory error when comparing 5 million records in two files using Perl in windows environment

I'm comparing two files of 5 million records each(each lines contains so many columns but I need to compare only 2 columns). any better approach to compare two files and find the differences without out of memory error?
I have tried parsing each file into different hashes and comparing both hashes leads to out of memory error.
The first question is, do you need to be using Perl to begin with?
Have you thought about using standard Linux utilities?
Depending on how your columns of data are constructed and delimited, there is a very good chance that Linux 'cut' could work for you to extract from each file only the column you need into a temp file.
Then use Linux 'sort' to sort each temp file.
Then use Linux 'diff' or 'comm' to compare the two temp files.
None of the above-suggested utilities should have any out-of-memory issues even on two files of 5 million records, assuming you have a reasonable amount of memory and disk space (e.g., for 'sort' to create its own temporary files).

indexing of large text files line by line for fast access

I have a very large text file around 43GB which I use to process them to generate another files in different forms. and i don't want to setup any databases or any indexing search engines
the data is in the .ttl format
<http://www.wikidata.org/entity/Q1000> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Gabon> .
<http://www.wikidata.org/entity/Q1000> <http://www.w3.org/2002/07/owl#sameAs> <http://en.dbpedia.org/resource/Gabon> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lad.dbpedia.org/resource/Mohandas_Gandhi> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lb.dbpedia.org/resource/Mohandas_Karamchand_Gandhi> .
target is generating all combinations from all triples who share same subject:
for example for the subject Q1000 :
<http://nl.dbpedia.org/resource/Gabon> <http://www.w3.org/2002/07/owl#sameAs> <http://en.dbpedia.org/resource/Gabon> .
<http://en.dbpedia.org/resource/Gabon> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Gabon> .
the problem:
the Dummy code to start with is iterating with complexity O(n^2) where n is the number of lines of the 45GB text file ,needless to say that it would take years to do so.
what i thought of to optimize :
loading a HashMap [String,IntArray] for indexing lines of appearance each key and using any library to access the file by line number for example:
Q1000 | 1,2,433
Q1001 | 2334,323,2124
drawbacks is that the index could be relatively large as well , considering that we will have another index for the access with specific line number , plus the overloaded i didnt try the performance of the
making a text file for each key like Q1000.txt for all triples contains subject Q1000 and iterating over them one by one and making combinations
drawbacks : this seems the fastest one and least memory consuming but certainly creating around 10 million files and accessing them will be a problem , is there and alternative for that ?
i'm using scala scripts for the task
Take the 43GB file in chunks that fit comfortably in memory and sort on the subject. Write the chunks separately.
Run a merge sort on the chunks (sorted by subject). It's really easy: you have as input iterators over two files, and you write out whichever input is less, then read from that one again (if there's any left).
Now you just need to make one pass through the sorted data to gather the groups of subjects.
Should take O(n) space and O(n log n) time, which for this sort of thing you should be able to afford.
A possible solution would be to use some existing map-reduce library. After all, your task is exactly what map-reduce is for. Even if you don't parallelize your computation on multiple machines, the main advantage is that it handles the management of splitting and merging for you.
There is an interesting library Apache Crunch with Scala API. I haven't used it myself, but it looks it could solve your problem well. Your lines would be split according to their subjects and then

How can I efficiently group a large list of URLs by their host name in Perl?

I have text file that contains over one million URLs. I have to process this file in order to assign URLs to groups, based on host address:
{
'http://www.ex1.com' => ['http://www.ex1.com/...', 'http://www.ex1.com/...', ...],
'http://www.ex2.com' => ['http://www.ex2.com/...', 'http://www.ex2.com/...', ...]
}
My current basic solution takes about 600 MB of RAM to do this (size of file is about 300 MB). Could you provide some more efficient ways?
My current solution simply reads line by line, extracts host address by regex and puts the url into a hash.
EDIT
Here is my implementation (I've cut off irrelevant things):
while($line = <STDIN>) {
chomp($line);
$line =~ /(http:\/\/.+?)(\/|$)/i;
$host = "$1";
push #{$urls{$host}}, $line;
}
store \%urls, 'out.hash';
One approach that you could take is tieing your url hash to a DBM like BerkeleyDB. You can explicitly give it options for how much memory it can use.
If you read 600MB from two files and store them in memory (in the hash) there is not much room for optimization in terms of memory use (short of compressing the data, which is probably not a viable option).
But depending on how you are going to use the data in the hash, it might be worth to consider storing the data in a database, and querying it for the information you need.
EDIT:
Based on the code you have posted, a quick optimization would be to not store the entire line but just the relative url. After all you already have the host name as a key in your hash.
Other than by storing your data structures to disk (tied DBM hash as suggested by Leon Timmermans, an SQL database such as SQLite3, etc.), you're not going to be able to reduce memory consumption much. 300M of actual data, plus the perl interpreter, plus the bytecode representation of your program, plus metadata on each of the extracted strings is going to add up to substantially more than 300M of total memory used if you keep it all in memory. If anything, I'm mildly surprised that it's only double the size of the input file.
One other thing to consider is that, if you're going to be processing the same file more than once, storing the parsed data structure on disk means that you'll never have to take the time to re-parse it on future runs of the program.
What exactly you are trying to acheive? If you are going for some complex analysis, storing to database is a good idea, of the grouping is just and intermediary step, you might just sort the text file and than process it sequentially directly deriving the results you are looking for.

How can I search a large sorted file in Perl?

Can you suggest me any CPAN modules to search on a large sorted file?
The file is a structured data about 15 million to 20 million lines, but I just need to find about 25,000 matching entries so I don't want to load the whole file into a hash.
Thanks.
Perl is well-suited to doing this, without the need for an external module (from CPAN or elsewhere).
Some code:
while (<STDIN>) {
if (/regular expression/) {
process each matched line
}
}
You'll need to come up with your own regular expression to specify which lines you want to match in your file. Once you match, you need your own code to process each matched line.
Put the above code in a script file and run it with your file redirected to stdin.
A scan over the whole file may be the fastest way. You can also try File::Sorted, which will do a binary search for a given record. Locating one record in a 25 million line file should require about 15-20 seeks for each record. This means that to search for 25,000 records, you would only need around .5 million seeks/comparison, compared to 25,000,000 to naively examine each row.
Disk IO being what it is, you may want to try the easy way first, but File::Sorted is a theoretical win.
You don't want to search the file, so do what you can to avoid it. We don't know much about your problem, but here are some tricks I've used in previous problems, all of which try to do work ahead of time:
Break up the file into a database. That could be SQLite, even.
Pre-index the file based on the data that you want to search.
Cache the results from previous searches.
Run common searches ahead of time, automatically.
All of these trade storage space to for speed. Some some these I would set up as overnight jobs so they were ready for people when they came into work.
You mention that you have structured data, but don't say any more. Is each line a complete record? How often does this file change?
Sounds like you really want a database. Consider SQLite, using Perl's DBI and DBD::SQLite modules.
When you process an input file with while ( <$filehandle> ), it only takes the file one line at a time (for each iteration of the loop), so you don't need to worry about it clogging up your memory. Not so with a for loop, which slurps the whole file into memory. Use a regex or whatever else to find what you're looking for and put that in a variable/array/hash or write it out to a new file.

How do you deal with lots of small files?

A product that I am working on collects several thousand readings a day and stores them as 64k binary files on a NTFS partition (Windows XP). After a year in production there is over 300000 files in a single directory and the number keeps growing. This has made accessing the parent/ancestor directories from windows explorer very time consuming.
I have tried turning off the indexing service but that made no difference. I have also contemplated moving the file content into a database/zip files/tarballs but it is beneficial for us to access the files individually; basically, the files are still needed for research purposes and the researchers are not willing to deal with anything else.
Is there a way to optimize NTFS or Windows so that it can work with all these small files?
NTFS actually will perform fine with many more than 10,000 files in a directory as long as you tell it to stop creating alternative file names compatible with 16 bit Windows platforms. By default NTFS automatically creates an '8 dot 3' file name for every file that is created. This becomes a problem when there are many files in a directory because Windows looks at the files in the directory to make sure the name they are creating isn't already in use. You can disable '8 dot 3' naming by setting the NtfsDisable8dot3NameCreation registry value to 1. The value is found in the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\FileSystem registry path. It is safe to make this change as '8 dot 3' name files are only required by programs written for very old versions of Windows.
A reboot is required before this setting will take effect.
NTFS performance severely degrades after 10,000 files in a directory. What you do is create an additional level in the directory hierarchy, with each subdirectory having 10,000 files.
For what it's worth, this is the approach that the SVN folks took in version 1.5. They used 1,000 files as the default threshold.
The performance issue is being caused by the huge amount of files in a single directory: once you eliminate that, you should be fine. This isn't a NTFS-specific problem: in fact, it's commonly encountered with user home/mail files on large UNIX systems.
One obvious way to resolve this issue, is moving the files to folders with a name based on the file name. Assuming all your files have file names of similar length, e.g. ABCDEFGHI.db, ABCEFGHIJ.db, etc, create a directory structure like this:
ABC\
DEF\
ABCDEFGHI.db
EFG\
ABCEFGHIJ.db
Using this structure, you can quickly locate a file based on its name. If the file names have variable lengths, pick a maximum length, and prepend zeroes (or any other character) in order to determine the directory the file belongs in.
I have seen vast improvements in the past from splitting the files up into a nested hierarchy of directories by, e.g., first then second letter of filename; then each directory does not contain an excessive number of files. Manipulating the whole database is still slow, however.
I have run into this problem lots of times in the past. We tried storing by date, zipping files below the date so you don't have lots of small files, etc. All of them were bandaids to the real problem of storing the data as lots of small files on NTFS.
You can go to ZFS or some other file system that handles small files better, but still stop and ask if you NEED to store the small files.
In our case we eventually went to a system were all of the small files for a certain date were appended in a TAR type of fashion with simple delimiters to parse them. The disk files went from 1.2 million to under a few thousand. They actually loaded faster because NTFS can't handle the small files very well, and the drive was better able to cache a 1MB file anyway. In our case the access and parse time to find the right part of the file was minimal compared to the actual storage and maintenance of stored files.
You could try using something like Solid File System.
This gives you a virtual file system that applications can mount as if it were a physical disk. Your application sees lots of small files, but just one file sits on your hard drive.
http://www.eldos.com/solfsdrv/
If you can calculate names of files, you might be able to sort them into folders by date, so that each folder only have files for a particular date. You might also want to create month and year hierarchies.
Also, could you move files older than say, a year, to a different (but still accessible) location?
Finally, and again, this requires you to be able to calculate names, you'll find that directly accessing a file is much faster than trying to open it via explorer. For example, saying
notepad.exe "P:\ath\to\your\filen.ame"
from the command line should actually be pretty quick, assuming you know the path of the file you need without having to get a directory listing.
One common trick is to simply create a handful of subdirectories and divvy up the files.
For instance, Doxygen, an automated code documentation program which can produce tons of html pages, has an option for creating a two-level deep directory hierarchy. The files are then evenly distributed across the bottom directories.
Aside from placing the files in sub-directories..
Personally, I would develop an application that keeps the interface to that folder the same, ie all files are displayed as being individual files. Then in the application background actually takes these files and combine them into a larger files(and since the sizes are always 64k getting the data you need should be relatively easy) To get rid of the mess you have.
So you can still make it easy for them to access the files they want, but also lets you have more control how everything is structured.
Having hundreds of thousands of files in a single directory will indeed cripple NTFS, and there is not really much you can do about that. You should reconsider storing the data in a more practical format, like one big tarball or in a database.
If you really need a separate file for each reading, you should sort them into several sub directories instead of having all of them in the same directory. You can do this by creating a hierarchy of directories and put the files in different ones depending on the file name. This way you can still store and load your files knowing just the file name.
The method we use is to take the last few letters of the file name, reversing them, and creating one letter directories from that. Consider the following files for example:
1.xml
24.xml
12331.xml
2304252.xml
you can sort them into directories like so:
data/1.xml
data/24.xml
data/1/3/3/12331.xml
data/2/5/2/4/0/2304252.xml
This scheme will ensure that you will never have more than 100 files in each directory.
Consider pushing them to another server that uses a filesystem friendlier to massive quantities of small files (Solaris w/ZFS for example)?
If there are any meaningful, categorical, aspects of the data you could nest them in a directory tree. I believe the slowdown is due to the number of files in one directory, not the sheer number of files itself.
The most obvious, general grouping is by date, and gives you a three-tiered nesting structure (year, month, day) with a relatively safe bound on the number of files in each leaf directory (1-3k).
Even if you are able to improve the filesystem/file browser performance, it sounds like this is a problem you will run into in another 2 years, or 3 years... just looking at a list of 0.3-1mil files is going to incur a cost, so it may be better in the long-term to find ways to only look at smaller subsets of the files.
Using tools like 'find' (under cygwin, or mingw) can make the presence of the subdirectory tree a non-issue when browsing files.
Rename the folder each day with a time stamp.
If the application is saving the files into c:\Readings, then set up a scheduled task to rename Reading at midnight and create a new empty folder.
Then you will get one folder for each day, each containing several thousand files.
You can extend the method further to group by month. For example, C:\Reading become c:\Archive\September\22.
You have to be careful with your timing to ensure you are not trying to rename the folder while the product is saving to it.
To create a folder structure that will scale to a large unknown number of files, I like the following system:
Split the filename into fixed length pieces, and then create nested folders for each piece except the last.
The advantage of this system is that the depth of the folder structure only grows as deep as the length of the filename. So if your files are automatically generated in a numeric sequence, the structure is only is deep is it needs to be.
12.jpg -> 12.jpg
123.jpg -> 12\123.jpg
123456.jpg -> 12\34\123456.jpg
This approach does mean that folders contain files and sub-folders, but I think it's a reasonable trade off.
And here's a beautiful PowerShell one-liner to get you going!
$s = '123456'
-join (( $s -replace '(..)(?!$)', '$1\' -replace '[^\\]*$','' ), $s )