How can I make log4perl output easier to read? - perl

When using log4perl, the debug log layout that I'm using is :
log4perl.appender.D10.layout=PatternLayout
log4perl.appender.D10.layout.ConversionPattern=%d [pid=%P] %p %F{1} (%L) %M %m%n
log4perl.appender.D10.Filter = DebugAndUp
This produces very verbose debug logs, for example:
2008/11/26 11:57:28 [pid=25485] DEBUG SomeModule.pm (331) functions::SomeModule::Test Test XXX was successfull
2008/11/26 11:57:29 [pid=25485] ERROR SomeOtherUnrelatedModule.pm (99999) functions::SomeModule::AnotherTest AnotherTest YYY has faled
This works great, and provides excellent debugging data.
However, each line of the debug log contains different function names, pid length, etc. This makes each line layout differently, and makes reading debug logs much harder than it needs to be.
Is there a way in log4perl to format the line so that the debugging metadata (everything up until the actual log message) be padded at the end with spaces/tabs, and have the actual message start at the same column of text?

You can pad the single fields that make up your entries. For example [pid=%5P] will always give you at least 5 characters for the PID.
The "Quantify Placeholders" section in the docs for Log::Log4perl::Layout gives more details.

There are a couple of ways to go with this, although you have to figure out which one works better for your situation:
Use a different appender if you are working live. Have that appender use a pattern that shows only the information you want. If you're working in a single process, for instance, your alternate appender might leave off the PID and the timestamp. You might only need the file name and line number.
Use %n to put newlines in the right place. That makes it multi-line output that is slightly harder to parse later, but you can choose another sequence for the input record separator (say, a literal "[EOL]") to make it easy to read entry-by-entry.
Log to a database instead of a file. For your reports, select just the columns you want to inspect.
Log everything, but write a filter to go through the log file ad-hoc to display just the parts that you want to see, such as only the debugging messages, the entries between certain times, only the entries involving a file, and so on.

Related

Dataprep import dataset does not detect headers in first row automatically

I am importing a dataset from Google Cloud Storage (parameterized) into Dataprep. So far, this worked perfectly fine and one of the feature that I liked is that it auto detects that the first row in my (application/octet-stream) .csv file are my headers.
However, today I tried to import a new dataset and it did not detect the headers, but it auto assigned column1, column2...
What has changed and or why is this the case. I have checked the box auto-detect and use UTF-8:
While the auto-detect option is usually pretty good, there are times that it fails for numerous reasons. I've specifically noticed this when the field names contain certain characters (e.g. comma, invisible characters like zero-width-non-joiners, null bytes), or when multiple different styles of newline delimiters are used within the same file.
Another case I saw this is when there were more columns of data than there were headers.
As you already hit on, you can use the following snippet to do mostly the same thing:
rename type: header method: filter sanitize: true
. . . or make separate recipe steps to convert the first row to header and then bulk-rename to your own liking.
More often than not, however, I've found that when auto-detect fails on a previously working file, it tends to be a sign of some sort of issue with the source file. I would look for mismatched data, as well as misplaced commas within the output, as well as comparing the header and some data rows to the original source using a plaintext editor.
When all else fails, you can try a CSV validator . . . but in my experience they tend to be incredibly opinionated when it comes to the formatting options of the fileā€”so depending on the system generating the CSV, it could either miss any errors or give false-positives. I have had two experiences where auto-detect fails for no apparent reason on perfectly clean files, so it is possible that process was just skipped for some reason.
It should also be noted that if you have a structured file that was correctly detected but want to revert it, you can go to the dataset details, select the "..." (More) button, and choose "Remove structure..." (I'm hoping that one day they'll let you do the opposite when you want to add structure to a raw dataset or work around bugs like this!)
Best of luck!
Can be resolved as a transformation within a Flow:
rename type: header method: filter sanitize: true

Troubleshooting "no writeable tags set" error

I'm trying to (ultimately) modify a batch of files but getting stuck in the basics as I try to modify a single file before running a batch command.
If someone could help me troubleshoot the command I'm inputting, that would be fantastic. I'm sure it's something very simple.
Thanks a lot for any help you can provide!
Here's the abbreviated image exif data:
-ExifToolVersion=10.10
-FileName=2018_11_13_1.jpeg
-Directory=.
-FileSize=2.8 MB
-FileModifyDate=2019:07:12 15:40:38-07:00
-FileAccessDate=2019:07:12 15:40:38-07:00
-FileInodeChangeDate=2019:07:23 10:38:02-07:00
-FilePermissions=rw-rw-r--
-FileType=JPEG
-FileTypeExtension=jpg
-MIMEType=image/jpeg
[...]
-ModifyDate=2018:11:13 12:00:53
[...]
-DateTimeOriginal=2018:11:13 12:00:53
-CreateDate=2018:11:13 12:00:53
My current input is: exiftool "-FileModifyDate<$filename00000" ./2018_11_13_1.jpeg
And the error message is:
Warning: No writable tags set from 2018_11_13_1.jpeg
0 image files updated
1 image files unchanged
And the exif data is, of course, unchanged.
I've confirmed that I can write a value to this tag, so there's definitely something going wrong in pulling from the filename.
( Continued from How to compensate for incomplete date/time info in filename )
The problem here is that you are trying to write from a tag named filename00000. If you check the example in the other post, you will see that there is a space after Filename. This sets it apart so that exiftool knows which is a tag name and which is other data.
There is possibly an additional problem here, though. Your filename has an extra number that is not the date. When exiftool tries to write the time stamp from the filename, it is going to end up with a value of "2018:11:13 10:00:00", which might become especially problematic if that last digit hits a value of 3 or more, resulting in a timestamp of "2018:11:13 30:00:00".
I would suggest using exiftool's Advanced Formatting Feature (a fancy way of saying that you can use perl code in the command) to strip the excess data. Something like
exiftool "-FileModifyDate<${filename;s/^(.*\d{4}_\d\d_\d\d).*/$1/} 000000" ./2018_11_13_1.jpeg
Though take note, if the filenames are in any other format, then it would require a different command.

ORSSerialPort data that should be a single line is coming in over multiple lines

I have an arduino that spits out a single line of GPS data down the serial line every half second, which I know works because I can look at the serial monitor in the arduino IDE and every half second, a new single line of data appears.
I'm now in the process of writing a Mac program using Swift that puts each coordinate on a map as it comes in through the serial port, and am using the ORSSerialPort library to connect to the arduino and receive its data. This works fine and I had a basic version working earlier, however I noticed that there were gaps in the GPS data (they were appearing in small groups on the map, with a noticeable space in between when it should be a constant line of them).
Before I had the map I had a text field that would have each GPS data line added to it as it came in, which produced the exact same output as the arduino IDE serial monitor, so I thought everything was working fine.
To try and fix the problem with the map I removed the map code and simply print()ed out each line into the XCode IDE as it came in through the serial port. To my surprise there were random line breaks in the data and I don't understand why. I feel that this may be causing the problems I am having (with splitting the string at every comma so I can extract the individual values) so would like to know why it comes out as a single line in the arduino IDE and the text field, but not in the XCode IDE and presumably whenever else I am working with the string.
EDIT: I prefixed the print to XCode IDE and the output to the text field with five plusses and suffixed them with five dashes, then set the serial port to close after sending a single report (what should be a single line of data). The output I got to both things ended up being three lines, each prefixed and suffixed with the plusses and dashes. See the photo below, which shows what should be a single line:
Why are my single lines of data coming through over multiple lines and behaving like individual variables (as in getting the last character of the line returns the last character of the first line of the three, not a semi colon)?
The issue isn't likely that there are extra newlines being inserted. Rather, ORSSerialPort (like the underlying POSIX API it uses) simply reports data to its delegate as it comes in. It has no way of knowing that for your particular use case you only want complete lines.
You need to buffer the incoming data and only process it when you've received a complete "line"/packet. ORSSerialPort includes an API, ORSSerialPacketDescriptor that makes this easier. There is further documentation for that API here: https://github.com/armadsen/ORSSerialPort/wiki/Packet-Parsing-API
Do note that this API doesn't (yet) support the use of a end delimiter only. You need to validate the entire packet beginning to end, as the parsing routine is "lazy". That is, it tries to find the smallest match possible starting from the end of the packet.

Store row numbers which are causing "error"

I have to retrieve certain information from urls. For this I have to enter text into fields of the url. I am using GET operation for this. I have to modify the text to replace spaces with "%20". Some times the text(which is taken from the database) is badly formed. I would like to know the row numbers so I can manually change the text for such rows in the database and run it again. I have tried to use the logs and errors section but with little luck. Does anybody have an idea of how to do this?
First shot: Output bad urls on the console
So far, I came up with the following job design for your problem:
The trick is to catch the exceptions of the tHttpRequest component and print the necessary details on the console. For this example, I included the line number, the exception message and the URL that produced the exception.
Output (I couldn't reproduce your "Illegal character error", so I took a different one):
Second shot: Output to a file
If you really need to output the line numbers to a file, things get a little more complicated.
Instead of printing the info straight onto the console, we collect all line numbers into a context variable of type (Java) List inside the tJavaFlex. After the usual URL processing (which I have left out from the job design to keep the example small), we iterate over the Java List
and save it into a tHashOutput, so that we can finally write to a file.
We cannot directly write to the file in the tLoop section, since the Iterate flow would lead to the situation the the tFileInputDelimited would be opened several times. If "Append" was disabled, only the last bad URL line number would finally appear in the output file. If "Append" was enabled, you would get the full list of line numbers after the very first job run - but you would append every time you run the job, making the list longer and longer. Workarounds would be to use a runtime-dependent file name (e.g. timestamp) or to delete the file at the beginning of the job run. I chose the third option, that overwrites the file every time we run the job. Feel free to chose among those options the one which suits your use case best.
Details
The tHashOutput/tHashInput components are not visible on default, but must be enabled first to show up: https://www.talendforge.org/forum/viewtopic.php?pid=107249#p107249
Context variable:
INIT:
tJavaFlex "catch errors", end code:
tLoop:
tFixedFlowInput "badURL":
tHashOutput:
Needs to have "Append" enabled.

Paginate a big text file

I have a big text file. Each line in the file is a record. And I need to parse the text file and show only 20 records in a HTML table at a time. I will have to support sorting as well.
What I am currently doing is read the file line by line based on the parameters start, stop, and page_size which is provided in querystring. It seems to work fine until I have to sort the records, because in order to sort I need to process every line in the text file.
So is there a Unix command which can I extract from line to line and sort? I tried grep but I do not know enough it to get this problem solved.
Take a look at the pr command. This is what we use to use all the time to paginate big files. You can set the page length, headers, footers, turn on line numbers, etc.
There's probably even a way to munge the output into HTML.
How big is the file?
man sort
Here