Best (quickest) way to parse and modify a file - command-line

Recently I have been using alot of text files (csv) with 10-60k lines, something like this
id1,id2
id3,id1
id81,id13
...
And most of the times, I need to extract this informaton in form of an array:
id1,id2,id3,id1,id81,id13
Or at times, unique elements array:
id1,id2,id3,id81
Then the result is used by my code (java) to do something.
Now, most of the times I write a java function which does the task for me, right from file reading, logic and then returning back the list of Ids.
Is there is a better and a quicker way to achieve this, maybe via command line?
Update:
If I was asked to build an app which was supposed to read a file and do something with it, I will surely write that logic in Java, but in my case I have to go through alot of text files which I get from the data warehouse, extract relevant info from it and then run it over my java based app.
Now, this is only for my experiment and evaluation of my app.

I copied your input in a file, test.csv:
$ cat test.csv
id1,id2
id3,id1
id81,id13
Now, with the 'tr' utility, you can do:
$ cat test.csv | tr '\n' ',' | tr -d ' '
and you have:
id1,id2,id3,id1,id81,id13

Unless your Java code is doing something silly, it will be in the same speed ballpark as anything else.
There's nothing magic about command-line tools that will make them faster than your code.

Related

Perl Command Understanding

I have been working on a product code to resolve an issue but am stuck on a line of code
Can anyone help me understand what exactly does this command do?
perl -MText::CSV -lne 'BEGIN{$p = Text::CSV->new()} print join "|", $p->fields() if $p->parse($_)' /home/daily/${FULL_FILENAME} > /home/output.txt
I think its to copy the file to my home location with some transformations but not sure exactly
This is a slightly broken program that translates a comma-separated values (CSV) file to a pipe-separated values file.
The particular command-line switches are documented in perlrun. This is a "one-liner", so you can read about those to see what's going on there.
The Text::CSV module deals with CSV files, and the program is parsing a line from the file and re-outputting as a pipe-separated file.
But, this program deals with each line as a complete record. That might be fine for you, but at some point you might end up with a literal value that has a newline in it, like a,"b\nc",d. Now reading line-by-line breaks the program since the quotes appear to be unclosed within the first line. Note only that, it blindly concatenates the parsed fields without considering if any of the fields should be quoted. It might be unlikely that a pipe character would be in the data, but the problem isn't it's rarity but the consequences and costliness when it does show up.
The rewrite.pl example script in the related module Text::CSV_XS is a tool that could replace this one-liner. It properly reads the input and knows how to properly translate it.

Cleaning up text files with sed?

I have a bunch of text files that need cleaning up. Example
`E..4B?#.#...
..9J5.....P0.z.n9.9.. ........
.k#a..5
E...y^#.r...J5..
E...y_#.r...J5..
..9.P..n9..0.z............
….2..3..9…n7…..#.yr`
Is there any way sed can do this? Like notice weird patterns?
For this answer, I will assume that you have access to standard unix/linux tools.
Your file might be in some word-processor format. If so, the best way to get rid of the junk is to open it with that program. You may be able to find out which with file:
$ file mysteryfile
mysteryfile: Composite Document File V2 Document, Little Endian, Os: Windows, Version 6.1 ....
If that doesn't work, there is a standard unix utility for extracting text from binary files. It is called strings:
$ strings mysteryfile
Some
Recovered Text
...
The behavior of strings can be fine tuned with several options. See man strings.

SAS- Reading multiple compressed data files

I hope you are all well.
So my question is about the procedure to open multiple raw data files that are compressed.
My files' names are ordered so I have for example : o_equities_20080528.tas.zip o_equities_20080529.tas.zip o_equities_20080530.tas.zip ...
Thank you all in advance.
How much work this will be depends on whether:
You have enough space to extract all the files simultaneously into one folder
You need to be able to keep track of which file each record has come from (i.e. you can't tell just from looking at a particular record).
If you have enough space to extract everything and you don't need to track which records came from which file, then the simplest option is to use a wildcard infile statement, allowing you to import the records from all of your files in one data step:
infile "c:\yourdir\o_equities_*.tas" <other infile options as per individual files>;
This syntax works regardless of OS - it's a SAS feature, not shell expansion.
If you have enough space to extract everything in advance but you need to keep track of which records came from each file, then please refer to this page for an example of how to do this using the filevar option on the infile statement:
http://www.ats.ucla.edu/stat/sas/faq/multi_file_read.htm
If you don't have enough space to extract everything in advance, but you have access to 7-zip or another archive utility, and you don't need to keep track of which records came from each file, you can use a pipe filename and extract to standard output. If you're on a Linux platform then this is very simple, as you can take advantage of shell expansion:
filename cmd pipe "nice -n 19 gunzip -c /yourdir/o_equities_*.tas.zip";
infile cmd <other infile options as per individual files>;
On windows it's the same sort of idea, but as you can't use shell expansion, you have to construct a separate filename for each zip file, or use some of 7zip's more arcane command-line options, e.g.:
filename cmd pipe "7z.exe e -an -ai!C:\yourdir\o_equities_*.tas.zip -so -y";
This will extract all files from all of the matching archives to standard output. You can narrow this down further via the 7-zip command if necessary. You will have multiple header lines mixed in with the data - you can use findstr to filter these out in the pipe before SAS sees them, or you can just choose to tolerate the odd error message here and there.
Here, the -an tells 7-zip not to read the zip file name from the command line, and the -ai tells it to expand the wildcard.
If you need to keep track of what came from where and you can't extract everything at once, your best bet (as far as I know) is to write a macro to process one file at a time, using the above techniques and add this information while you're importing each dataset.

How can I open multiple attachments of the same name in an email, then move the sender of the attachment to a spreadsheet?

I have an internship and was recently assigned the tedious task of cleaning the email lists. My employer has sent me a series of email with email bounces as attachments, many at a time, all with the same name. I have considered ways of doing this most efficiently, I'm looking to avoid just clicking through like a slave. My thoughts were to create a macro using autohotkey's language, but I feel like maybe a batch file or some sort of Perl might do the same thing. Could anybody give me an idea as to how to do this, specifically with a batch file? Thanks in advance!
Mail::DeliveryStatus::BounceParser parses bouncing email addresses out of delivery report messages.
If you don't know any perl, then I recommend that you first convert the mailbox into some format that stores each email in separate text files, like MH or similar.
At that point, you can trivially use the command grep _pattern_ | sed -e 's/:.*//' | sort | uniq > _list_ to obtain lists of all files matching _pattern_. You may inspect/edit this file _list_ to verify that the desired results were obtained.
You may then create another director junk or whatever and move all the files listed in _list_ into junk with a command like perl -e 'chomp; rename($_,"junk");' < _list_.
If you'll need this regularly, then you could automate this further, likely using perl alone, but a one off task will probably involve more messing about with getting the right message list.
Alternatively, you could load all the emails into a single folder in an sane mail reader, like Mac OS X's Mail.app, and do simply search, select all, move/delete commands.

Getting output on the same line of a file in DOS?

If I have output from two sources that I want to put together on the same line, how would I do that?
In my case I have a file and a program. The file is something like this:
listOfThings=
My program outputs a list of strings on a single line. I want have a small script that runs nightly to put these two things together on a single line. I can't figure out how to do this right though
example batch file
type header.txt > outputfile.txt
myProgram >> outputfile.txt
which results in this:
listOfThings=
foo bar baz etc
I really need the output file to have the list immediately follow the =, but I can't figure out how to do it with the >> operator. (and before anyone suggests it, I can't do something like put a \ on the end of the listOfThings= line, that won't work for what I'm trying to do)
You need to make sure that the contents of header.txt does not have a carriage return linefeed pair in it. Look at it with a hex editor and make sure there is no 0x0d0a in it.
Have you made sure that header.txt doesn't have any line separators in it at all? (Ie, the = is the very last byte of the file).
Also, try copying header.txt to outputfile.txt in case type is appending a line feed on it's own.