I'm trying to compare two files based off of variables capture from the first input file. If I have $inputvariable and I want to quickly check it against the contents of the second input file, how do I perform that? I've tried using Get-Content, however the second file has roughly 500,000 entries. The first file has around 1000, so even with breaking, it still checks every line from the first input file rather slowly. Essentially, is there a way to index the second input file as if it were entered into a database and quickly search?
Related
I am not a Perl programmer but I've inherited existing code that is going to a directory, finding all files iren that folder and subfolder (usually JPG or Office files) and then converting this into a single file to use to load into a SQL Server database. The customer has about 500,000 of these files.
It takes about 45 mins to create the file and then another 45 mins for SQL to load the data. Crudely, it's doing about 150 per second which is reasonable but time is the issue for the job. There are many reasons I don't want to use other techniques so please don't suggest other options unless closely aligned to this process.
What I was considering is to improve speed by running something like 10 processes concurrent. Each process would get passed another argument (0-9). Each process would go to the directory and find all files as it is currently doing but for each file found, it would hash or kludge the filename down to a single digit (0-9) and if that matched the supplied argument, the process would process that file and write it out to it's unique file stream.
Then I would have 10 output files at the end. I doubt that the SQL Server side could be improved as I would have to load to separate tables and then merge in the database and as these are BLOB objects, will not be fast.
So I am looking for some basic code or clues on what functions to use in Perl to take a variable (the file name $File) and generate a single 0 to 9 value based on that. It is probably done by getting the ascii value of each char, then adding these together to get a long number, then add these individual numbers together and eventually you'll get an answer.
Any clues or suggested techniques?
Here's an easy one to implement suggested in the unpack function documentation:
sub string_to_code {
# convert an arbitrary string to a digit from 0-9
my ($string) = #_;
return unpack("%32W*",$string) % 10;
}
I would like to force overwrite logdata to a CSV file. It might well be that another user is reading that file at the moment.
What give me the possibility take no care about this kind of locking and write in the file? Of course, the appended data should displayed after writing when the user close and open file again.
Maybe a useful information:
Writing data would occur many times for a very short term.
I'm working on a MATLAB program in which I read a text file using textscan and then store the data in corresponding arrays. It does that every run, and it takes about two minutes each time. I would like to know if there is a way to save the arrays after I loaded the data, and have the program remember them so I can read the data only once and save time while running. I looked into the load function, but I'm not sure if that's what I need.
Before I dive into my question, I wanted to point out that I am doing this partially to get familiarized with node and mongo. I realize there are probably better ways to accomplish my final goal, but what I want to get out of this is a general methodology that might apply to other situations.
The goal:
I have a csv file containing 6+ million geo-ip records. Each record contains 4 fields in total and the file is roughly 180mb.
I want to process this file and insert each record into a MongoDB collection called "Blocks". Each "Block" will have the 4 fields from the csv file.
My current approach
I am using mongoose to create a "Block" model and a ReadStream to process the file line by line. The code I'm using to process the file and extract the records works and I can make it print each record to the console if I want to.
For each record in the file, it calls a function that creates a new Blocks object (using mongoose), populates the fields and saves it.
This is the code inside of the function that gets called every time a line is read and parsed. The "rec" variable contains an object representing a single record from the file.
block = new Block();
block.ipFrom = rec.startipnum;
block.ipTo = rec.endipnum;
block.location = rec.locid;
connections++;
block.save(function(err){
if(err) throw err;
//console.log('.');
records_inserted++;
if( --connections == 0 ){
mongoose.disconnect();
console.log( records_inserted + ' records inserted' );
}
});
The problem
Since the file is being read asynchronously, more than one line is processed at the same time and reading the file is much faster than MongoDB can write so the whole process stalls at around 282000 records and gets as high up as 5k+ concurrent Mongo connections. It doesn't crash.. it just sits there doing nothing and doesn't seem to recover, nor does the item count in the mongo collection go up any further.
What I'm after here is a general approach to solving this problem. How would I cap the number of concurrent Mongo connections? I would like to take advantage of being able to insert multiple records at the same time, but I'm missing a way to regulate the flow.
Thank you in advance.
Not a answer to your exact situation of importing from .csv file, but instead, on doing bulk insert(s)
-> First of all there are no special 'bulk' insertions operations, its all a forEach in the end.
-> if you try to read a big file async-ly which would be a lot faster then the write process, then you should consider changing your approach, first of all figure out how much can your setup handle, (or just hit-n-trial).
---> After that, change the way you read from file, you dont need to read every line from file, async-ly, learn to wait, use forEach, forEachSeries from Async.js to bring down your reads near to mongodb write level, and you are good to go.
I would try the commandline CSV import option from Mongodb - it should do what you are after without having to write any code
Can you suggest me any CPAN modules to search on a large sorted file?
The file is a structured data about 15 million to 20 million lines, but I just need to find about 25,000 matching entries so I don't want to load the whole file into a hash.
Thanks.
Perl is well-suited to doing this, without the need for an external module (from CPAN or elsewhere).
Some code:
while (<STDIN>) {
if (/regular expression/) {
process each matched line
}
}
You'll need to come up with your own regular expression to specify which lines you want to match in your file. Once you match, you need your own code to process each matched line.
Put the above code in a script file and run it with your file redirected to stdin.
A scan over the whole file may be the fastest way. You can also try File::Sorted, which will do a binary search for a given record. Locating one record in a 25 million line file should require about 15-20 seeks for each record. This means that to search for 25,000 records, you would only need around .5 million seeks/comparison, compared to 25,000,000 to naively examine each row.
Disk IO being what it is, you may want to try the easy way first, but File::Sorted is a theoretical win.
You don't want to search the file, so do what you can to avoid it. We don't know much about your problem, but here are some tricks I've used in previous problems, all of which try to do work ahead of time:
Break up the file into a database. That could be SQLite, even.
Pre-index the file based on the data that you want to search.
Cache the results from previous searches.
Run common searches ahead of time, automatically.
All of these trade storage space to for speed. Some some these I would set up as overnight jobs so they were ready for people when they came into work.
You mention that you have structured data, but don't say any more. Is each line a complete record? How often does this file change?
Sounds like you really want a database. Consider SQLite, using Perl's DBI and DBD::SQLite modules.
When you process an input file with while ( <$filehandle> ), it only takes the file one line at a time (for each iteration of the loop), so you don't need to worry about it clogging up your memory. Not so with a for loop, which slurps the whole file into memory. Use a regex or whatever else to find what you're looking for and put that in a variable/array/hash or write it out to a new file.