Processing text inside variable before writing it into file - perl

I'm using Perl WWW::Mechanize package in order to fetch and process data from some websites. Usually my way of action is as follows:
Fetch a webpage
$mech->get("$url");
Save the webpage contents in a variable (BTW, I'm not sure if it's the right way to save this amount of text inside a scalar which, as far as I know, supposed to be used for a single value)
my $list = $mech->content();
Use a subroutine that I've created to write the contents of the variable to a text file. (The writetoFile subroutine includes few more features, like path and existing file validations..)
writeToFile("$filename.tmp","$path",$list);
Processing the text in a file created in the previous step by creating an additional file and save the processed content there (Then deleting the initial temporary file).
What I wonder about, is whether it is possible to perform the processing before storing the text in a file, directly inside the $list variable? The whole process is working as expected but I don't really like the logic behind it and it seems a bit inefficient as well, since I have to rewrite the same file multiple times.
EDIT:
Just to give a bit more information about what I'm actually after when I process the variable contents. So the data I fetch from the website in this case is actually a list of items separated by a blank line and the first line is irrelevant to me. So what I'm doing while processing this data is 2 things:
Remove the empty (CRLF) lines
Remove the first line if it includes a particular text.
Ideally I want to save the processed list (no blank spaces and first line removed) in a file without creating any additional files on the way. In order to save the file I would like to use the writeToFile sub (I wrote) since it also performs validation on whether such file already exists (If a file will be saved before final processing - the writeToFile will always rewrite the existing file).
Hope it makes sense.

You're looking for split. The pattern depends: use (?<=\n) split at a new line character and keep it. If that doesn't matter, use \R to include all sort of line breaks.
foreach my $line (split qr/\R/, $mech->content) {
…
}
Now the obligatory HTML-parsing-with-regex admonishment: if you get HTML source with Mechanize, parsing it line-by-line does not make much sense. You probably want to process the HTML-stripped text version of the document instead, or pass the HTML source to a parser such as Web::Query to declaratively get at the pieces you need.

Related

Message passing between two perl files

I have 2 Perl files which cannot be merged and have to be run separately. My first file does certain initialization of parameters which are used by my second file, which performs some testing. Now I want to use the parameters initialized in the first file in the second file so how can I do that?
I will write a Perl script for Software testing. I need to write two files one is initialization file which will do all the initialization and the second file contains the test sequence to execute which will use initialize parameters. I need to run both files separately. Execution-wise my first file will execute first and then my second file will run.
I am thinking of using XML file where the first file will log the parameter in the file and the second file will get the parameters from that file? Is there any better way to do this?
If your initialization produces only plain key-value pairs then any way of serialising data will suffice. Otherwise XML is probably the worst option for your case. You might need to put a lot of effort to get the same data structure in your second script. This happens because by default xml modules do not know what should be an atrribute, a child node or an array of nodes. For example, passing a one-element array of hashes to xml from first script might turn to just a single hash in your second script. The results will highly depend on xml modules, options you pass to them and the data itself.
JSON should'n have such issues. It might have unnecessary type conversions but you shouldn't really notice them.
Storable guarantees that you get the same data in your second script.
You might find Data::Dumper to be an easier solution. But it has some security issues since you need to execute its output in your second script.
All of the above are not meant to be used with data containing self-references and anything but scalars, arrayrefs and hashrefs.

Talend - Extract FileName from tLogRow/tSort

I am new to Talend and just trying to work my way through it.
Problem Statement
I need to process a positional file, from a list of files. Need to identify the latest file first and then process only that file. I was able to identify the most updated file. And then I was able to create another flow which processes the positional file. The problem is combining these two flows so that I am able to identify the most recent file and have just that one processed.
Tried so far
Have been trying to extract the most recent file from a list within a directory. Iterated through all the files, retained their properties in a buffer. Post completion of this sub-task, read through the buffer, sorted with descending mime, extracted the top record and was able to print it using tLogRow.
All seems to be fine except I don't know how to use the filename now for next task.
I am certain this is very rudimentary but I'll be honest, I've been scourging the internet/help from quite some time now, with no success.
Any pointers would help.
The job flow is attached for your reference.
First of all, you can simplify your job by using tFileList's capabilities. It can sort files by their modified date:
Next, use tIterateToFlow to convert each iteration to a row:
(String)globalMap.get("tFileList_1_CURRENT_FILEPATH")
and tSampleRow with a range of "1", to get the most recent file.
Then store the result in a global variable. In the next subjob, just use that global variable as your filename in tFileInputPositional.

Store row numbers which are causing "error"

I have to retrieve certain information from urls. For this I have to enter text into fields of the url. I am using GET operation for this. I have to modify the text to replace spaces with "%20". Some times the text(which is taken from the database) is badly formed. I would like to know the row numbers so I can manually change the text for such rows in the database and run it again. I have tried to use the logs and errors section but with little luck. Does anybody have an idea of how to do this?
First shot: Output bad urls on the console
So far, I came up with the following job design for your problem:
The trick is to catch the exceptions of the tHttpRequest component and print the necessary details on the console. For this example, I included the line number, the exception message and the URL that produced the exception.
Output (I couldn't reproduce your "Illegal character error", so I took a different one):
Second shot: Output to a file
If you really need to output the line numbers to a file, things get a little more complicated.
Instead of printing the info straight onto the console, we collect all line numbers into a context variable of type (Java) List inside the tJavaFlex. After the usual URL processing (which I have left out from the job design to keep the example small), we iterate over the Java List
and save it into a tHashOutput, so that we can finally write to a file.
We cannot directly write to the file in the tLoop section, since the Iterate flow would lead to the situation the the tFileInputDelimited would be opened several times. If "Append" was disabled, only the last bad URL line number would finally appear in the output file. If "Append" was enabled, you would get the full list of line numbers after the very first job run - but you would append every time you run the job, making the list longer and longer. Workarounds would be to use a runtime-dependent file name (e.g. timestamp) or to delete the file at the beginning of the job run. I chose the third option, that overwrites the file every time we run the job. Feel free to chose among those options the one which suits your use case best.
Details
The tHashOutput/tHashInput components are not visible on default, but must be enabled first to show up: https://www.talendforge.org/forum/viewtopic.php?pid=107249#p107249
Context variable:
INIT:
tJavaFlex "catch errors", end code:
tLoop:
tFixedFlowInput "badURL":
tHashOutput:
Needs to have "Append" enabled.

Changing value of a variable in perl using another script

I have an unusual requirement. I have a big config /perl file in which I would like to change the value of one variable before my run. To avoid manually finding the variable and changing it's value, I would like to write a perl script to change the name of the variable. Is that possible to do this without parsing every single line of big perl file, creating a temporary copy and overwriting old file.
Something is parsing this file at some point, right? Give it a list of things to substitute and you can have it only do the substitutions when it needs it. This avoids a big pre-startup overhead and if the config file is sparsely used, will result in a faster overall run.
So just make the thing reading it look for certain patterns to substitute in and a file (or passed in on the command line or environment variables, or...) for the values it should use and go from there.
If you don't have control over the parser, then there's not much to do. You could one-time pre-process the config file to determine EXACTLY where the substitutions need to be and write a faster processor, since it won't have to do any string parsing for regular expressions, just moving a bunch of bytes as fast as your computer can move them to the new file with the substitutions in place.

zip recursively each file in a dir, where the name of the file has spaces in it

I am quite stuck; I need to compress the content of a folder, where I have multiple files (extension .dat). I went for shell scripting.
So far I told myself that is not that hard: I just need to recursively read the content of the dir, get the name of the file and zip it, using the name of the file itself.
This is what I wrote:
for i in *.dat; do zip $i".zip" $i; done
Now when I try it I get a weird behavior: each file is called like "12/23/2012 data102 test1.dat"; and when I run this sequence of commands; I see that zip instead of recognizing the whole file name, see each part of the string as single entity, causing the whole operation to fail.
I told myself that I was doing something wrong, and that the i variable was wrong; so I have replaced echo, instead than the zip command (to see which one was the output of the i variable); and the $i output is the full name of the file, not part of it.
I am totally clueless at this point about what is going on...if the variable i is read by zip it reads each single piece of the string, instead of the whole thing, while if I use echo to see the content of that variable it gets the correct output.
Do I have to pass the value of the filename to zip in a different way? Since it is the content of a variable passed as parameter I was assuming that it won't matter if the string is one or has spaces in it, and I can't find in the man page the answer (if there is any in there).
Anyone knows why do I get this behavior and how to fix it? Thanks!
You need to quote anything with spaces in it.
zip "$i.zip" "$i"
Generally speaking, any variable interpolation should have double quotes unless you specifically require the shell to split it into multiple tokens. The internal field separator $IFS defaults to space and tab, but you can change it to make the shell do word splitting on arbitrary separators. See any decent beginners' shell tutorial for a detailed account of the shell's quoting mechanisms.