Store row numbers which are causing "error" - talend

I have to retrieve certain information from urls. For this I have to enter text into fields of the url. I am using GET operation for this. I have to modify the text to replace spaces with "%20". Some times the text(which is taken from the database) is badly formed. I would like to know the row numbers so I can manually change the text for such rows in the database and run it again. I have tried to use the logs and errors section but with little luck. Does anybody have an idea of how to do this?

First shot: Output bad urls on the console
So far, I came up with the following job design for your problem:
The trick is to catch the exceptions of the tHttpRequest component and print the necessary details on the console. For this example, I included the line number, the exception message and the URL that produced the exception.
Output (I couldn't reproduce your "Illegal character error", so I took a different one):
Second shot: Output to a file
If you really need to output the line numbers to a file, things get a little more complicated.
Instead of printing the info straight onto the console, we collect all line numbers into a context variable of type (Java) List inside the tJavaFlex. After the usual URL processing (which I have left out from the job design to keep the example small), we iterate over the Java List
and save it into a tHashOutput, so that we can finally write to a file.
We cannot directly write to the file in the tLoop section, since the Iterate flow would lead to the situation the the tFileInputDelimited would be opened several times. If "Append" was disabled, only the last bad URL line number would finally appear in the output file. If "Append" was enabled, you would get the full list of line numbers after the very first job run - but you would append every time you run the job, making the list longer and longer. Workarounds would be to use a runtime-dependent file name (e.g. timestamp) or to delete the file at the beginning of the job run. I chose the third option, that overwrites the file every time we run the job. Feel free to chose among those options the one which suits your use case best.
Details
The tHashOutput/tHashInput components are not visible on default, but must be enabled first to show up: https://www.talendforge.org/forum/viewtopic.php?pid=107249#p107249
Context variable:
INIT:
tJavaFlex "catch errors", end code:
tLoop:
tFixedFlowInput "badURL":
tHashOutput:
Needs to have "Append" enabled.

Related

Dataprep import dataset does not detect headers in first row automatically

I am importing a dataset from Google Cloud Storage (parameterized) into Dataprep. So far, this worked perfectly fine and one of the feature that I liked is that it auto detects that the first row in my (application/octet-stream) .csv file are my headers.
However, today I tried to import a new dataset and it did not detect the headers, but it auto assigned column1, column2...
What has changed and or why is this the case. I have checked the box auto-detect and use UTF-8:
While the auto-detect option is usually pretty good, there are times that it fails for numerous reasons. I've specifically noticed this when the field names contain certain characters (e.g. comma, invisible characters like zero-width-non-joiners, null bytes), or when multiple different styles of newline delimiters are used within the same file.
Another case I saw this is when there were more columns of data than there were headers.
As you already hit on, you can use the following snippet to do mostly the same thing:
rename type: header method: filter sanitize: true
. . . or make separate recipe steps to convert the first row to header and then bulk-rename to your own liking.
More often than not, however, I've found that when auto-detect fails on a previously working file, it tends to be a sign of some sort of issue with the source file. I would look for mismatched data, as well as misplaced commas within the output, as well as comparing the header and some data rows to the original source using a plaintext editor.
When all else fails, you can try a CSV validator . . . but in my experience they tend to be incredibly opinionated when it comes to the formatting options of the file—so depending on the system generating the CSV, it could either miss any errors or give false-positives. I have had two experiences where auto-detect fails for no apparent reason on perfectly clean files, so it is possible that process was just skipped for some reason.
It should also be noted that if you have a structured file that was correctly detected but want to revert it, you can go to the dataset details, select the "..." (More) button, and choose "Remove structure..." (I'm hoping that one day they'll let you do the opposite when you want to add structure to a raw dataset or work around bugs like this!)
Best of luck!
Can be resolved as a transformation within a Flow:
rename type: header method: filter sanitize: true

Talend - Extract FileName from tLogRow/tSort

I am new to Talend and just trying to work my way through it.
Problem Statement
I need to process a positional file, from a list of files. Need to identify the latest file first and then process only that file. I was able to identify the most updated file. And then I was able to create another flow which processes the positional file. The problem is combining these two flows so that I am able to identify the most recent file and have just that one processed.
Tried so far
Have been trying to extract the most recent file from a list within a directory. Iterated through all the files, retained their properties in a buffer. Post completion of this sub-task, read through the buffer, sorted with descending mime, extracted the top record and was able to print it using tLogRow.
All seems to be fine except I don't know how to use the filename now for next task.
I am certain this is very rudimentary but I'll be honest, I've been scourging the internet/help from quite some time now, with no success.
Any pointers would help.
The job flow is attached for your reference.
First of all, you can simplify your job by using tFileList's capabilities. It can sort files by their modified date:
Next, use tIterateToFlow to convert each iteration to a row:
(String)globalMap.get("tFileList_1_CURRENT_FILEPATH")
and tSampleRow with a range of "1", to get the most recent file.
Then store the result in a global variable. In the next subjob, just use that global variable as your filename in tFileInputPositional.

ORSSerialPort data that should be a single line is coming in over multiple lines

I have an arduino that spits out a single line of GPS data down the serial line every half second, which I know works because I can look at the serial monitor in the arduino IDE and every half second, a new single line of data appears.
I'm now in the process of writing a Mac program using Swift that puts each coordinate on a map as it comes in through the serial port, and am using the ORSSerialPort library to connect to the arduino and receive its data. This works fine and I had a basic version working earlier, however I noticed that there were gaps in the GPS data (they were appearing in small groups on the map, with a noticeable space in between when it should be a constant line of them).
Before I had the map I had a text field that would have each GPS data line added to it as it came in, which produced the exact same output as the arduino IDE serial monitor, so I thought everything was working fine.
To try and fix the problem with the map I removed the map code and simply print()ed out each line into the XCode IDE as it came in through the serial port. To my surprise there were random line breaks in the data and I don't understand why. I feel that this may be causing the problems I am having (with splitting the string at every comma so I can extract the individual values) so would like to know why it comes out as a single line in the arduino IDE and the text field, but not in the XCode IDE and presumably whenever else I am working with the string.
EDIT: I prefixed the print to XCode IDE and the output to the text field with five plusses and suffixed them with five dashes, then set the serial port to close after sending a single report (what should be a single line of data). The output I got to both things ended up being three lines, each prefixed and suffixed with the plusses and dashes. See the photo below, which shows what should be a single line:
Why are my single lines of data coming through over multiple lines and behaving like individual variables (as in getting the last character of the line returns the last character of the first line of the three, not a semi colon)?
The issue isn't likely that there are extra newlines being inserted. Rather, ORSSerialPort (like the underlying POSIX API it uses) simply reports data to its delegate as it comes in. It has no way of knowing that for your particular use case you only want complete lines.
You need to buffer the incoming data and only process it when you've received a complete "line"/packet. ORSSerialPort includes an API, ORSSerialPacketDescriptor that makes this easier. There is further documentation for that API here: https://github.com/armadsen/ORSSerialPort/wiki/Packet-Parsing-API
Do note that this API doesn't (yet) support the use of a end delimiter only. You need to validate the entire packet beginning to end, as the parsing routine is "lazy". That is, it tries to find the smallest match possible starting from the end of the packet.

Processing text inside variable before writing it into file

I'm using Perl WWW::Mechanize package in order to fetch and process data from some websites. Usually my way of action is as follows:
Fetch a webpage
$mech->get("$url");
Save the webpage contents in a variable (BTW, I'm not sure if it's the right way to save this amount of text inside a scalar which, as far as I know, supposed to be used for a single value)
my $list = $mech->content();
Use a subroutine that I've created to write the contents of the variable to a text file. (The writetoFile subroutine includes few more features, like path and existing file validations..)
writeToFile("$filename.tmp","$path",$list);
Processing the text in a file created in the previous step by creating an additional file and save the processed content there (Then deleting the initial temporary file).
What I wonder about, is whether it is possible to perform the processing before storing the text in a file, directly inside the $list variable? The whole process is working as expected but I don't really like the logic behind it and it seems a bit inefficient as well, since I have to rewrite the same file multiple times.
EDIT:
Just to give a bit more information about what I'm actually after when I process the variable contents. So the data I fetch from the website in this case is actually a list of items separated by a blank line and the first line is irrelevant to me. So what I'm doing while processing this data is 2 things:
Remove the empty (CRLF) lines
Remove the first line if it includes a particular text.
Ideally I want to save the processed list (no blank spaces and first line removed) in a file without creating any additional files on the way. In order to save the file I would like to use the writeToFile sub (I wrote) since it also performs validation on whether such file already exists (If a file will be saved before final processing - the writeToFile will always rewrite the existing file).
Hope it makes sense.
You're looking for split. The pattern depends: use (?<=\n) split at a new line character and keep it. If that doesn't matter, use \R to include all sort of line breaks.
foreach my $line (split qr/\R/, $mech->content) {
…
}
Now the obligatory HTML-parsing-with-regex admonishment: if you get HTML source with Mechanize, parsing it line-by-line does not make much sense. You probably want to process the HTML-stripped text version of the document instead, or pass the HTML source to a parser such as Web::Query to declaratively get at the pieces you need.

How can I make log4perl output easier to read?

When using log4perl, the debug log layout that I'm using is :
log4perl.appender.D10.layout=PatternLayout
log4perl.appender.D10.layout.ConversionPattern=%d [pid=%P] %p %F{1} (%L) %M %m%n
log4perl.appender.D10.Filter = DebugAndUp
This produces very verbose debug logs, for example:
2008/11/26 11:57:28 [pid=25485] DEBUG SomeModule.pm (331) functions::SomeModule::Test Test XXX was successfull
2008/11/26 11:57:29 [pid=25485] ERROR SomeOtherUnrelatedModule.pm (99999) functions::SomeModule::AnotherTest AnotherTest YYY has faled
This works great, and provides excellent debugging data.
However, each line of the debug log contains different function names, pid length, etc. This makes each line layout differently, and makes reading debug logs much harder than it needs to be.
Is there a way in log4perl to format the line so that the debugging metadata (everything up until the actual log message) be padded at the end with spaces/tabs, and have the actual message start at the same column of text?
You can pad the single fields that make up your entries. For example [pid=%5P] will always give you at least 5 characters for the PID.
The "Quantify Placeholders" section in the docs for Log::Log4perl::Layout gives more details.
There are a couple of ways to go with this, although you have to figure out which one works better for your situation:
Use a different appender if you are working live. Have that appender use a pattern that shows only the information you want. If you're working in a single process, for instance, your alternate appender might leave off the PID and the timestamp. You might only need the file name and line number.
Use %n to put newlines in the right place. That makes it multi-line output that is slightly harder to parse later, but you can choose another sequence for the input record separator (say, a literal "[EOL]") to make it easy to read entry-by-entry.
Log to a database instead of a file. For your reports, select just the columns you want to inspect.
Log everything, but write a filter to go through the log file ad-hoc to display just the parts that you want to see, such as only the debugging messages, the entries between certain times, only the entries involving a file, and so on.