Find corrupt data in xlsx file

Find corrupt data in xlsx file - perl

We are generating xlsx files using a perl script. Files usually contains thousands of records. This makes spotting errors a very difficult operation.
This process was working since years without problems.
This week we got a request to check a file which contains errors. While opening Excel prompted that the file contains errors and asked whether we want to repair them.
In fact we do not want to recover the data but want to know which part of the file is corrupt. The error should be coming from corrupt data and we are interested to identify these data.
the log message shows the following:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<logFileName>error068200_01.xml</logFileName> </br>
<summary>Errors were detected in file 'D:\Temp\20161020\file_name.xlsx'</summary>
<repairedRecords summary="Following is a list of repairs:"><repairedRecord>Repaired Records: Cell information from /xl/worksheets/sheet1.xml part</repairedRecord>
</repairedRecords>
</recoveryLog>
The error should come from corrupt data. Is there any tool/method which helps to spot this corrupt data?
I tried renaming it a zip file, extracting it and opening it via an XML editor but was not able to find any errors in XML file.
We also checked that the different XML file structures are fine.
Thank you and best regards

As expected, the problem was coming from text cells containing numbers having an E in the middle.I used the following steps to identify the erronous cells.
1. Wrote small Java class to read the file. The class was checking the cell type and displaying the value afterwards.The java program generated an Exception at some line "Cannot get a numeric value from a text cell" even If I was correctly checking the cell type before displaying the content.
2. I checked the opened Excel file at that line and found that the cell contains only 'inf'.
3. I opened the file using open office and looked at the same cells. They contain 0.
4. I debugged the program generating the data and found out that these cells contain data like '914E5514'. Seems that E which was interpreted by Excel as an exponent.We changed the program to use the format '#' for that cell and this solved the issue.
Thank you.

Thank you very much, you helped me a lot by saying that 1 particular content item may be the root problem.
My corrupted content was https://www.example.com XYZ ... ASDAS
Solution: www.example.com XYZ ... ASDAS
This is something which cannot be handled by excel. Would be nice to have a list of thing which do not work

Related

Dataprep import dataset does not detect headers in first row automatically

I am importing a dataset from Google Cloud Storage (parameterized) into Dataprep. So far, this worked perfectly fine and one of the feature that I liked is that it auto detects that the first row in my (application/octet-stream) .csv file are my headers.
However, today I tried to import a new dataset and it did not detect the headers, but it auto assigned column1, column2...
What has changed and or why is this the case. I have checked the box auto-detect and use UTF-8:

While the auto-detect option is usually pretty good, there are times that it fails for numerous reasons. I've specifically noticed this when the field names contain certain characters (e.g. comma, invisible characters like zero-width-non-joiners, null bytes), or when multiple different styles of newline delimiters are used within the same file.
Another case I saw this is when there were more columns of data than there were headers.
As you already hit on, you can use the following snippet to do mostly the same thing:
rename type: header method: filter sanitize: true
. . . or make separate recipe steps to convert the first row to header and then bulk-rename to your own liking.
More often than not, however, I've found that when auto-detect fails on a previously working file, it tends to be a sign of some sort of issue with the source file. I would look for mismatched data, as well as misplaced commas within the output, as well as comparing the header and some data rows to the original source using a plaintext editor.
When all else fails, you can try a CSV validator . . . but in my experience they tend to be incredibly opinionated when it comes to the formatting options of the file—so depending on the system generating the CSV, it could either miss any errors or give false-positives. I have had two experiences where auto-detect fails for no apparent reason on perfectly clean files, so it is possible that process was just skipped for some reason.
It should also be noted that if you have a structured file that was correctly detected but want to revert it, you can go to the dataset details, select the "..." (More) button, and choose "Remove structure..." (I'm hoping that one day they'll let you do the opposite when you want to add structure to a raw dataset or work around bugs like this!)
Best of luck!

Can be resolved as a transformation within a Flow:
rename type: header method: filter sanitize: true

HTML file encoding issue (Chinese)

I have a HTML file that contains some records shown in a table that somehow got encoded in the wrong way. A large part of the file is correct and shows the content as expect but some parts of the file seem to be encoded in the wrong way. Actually the whole HTML part is shown correctly (all the elements etc) but the values within the cells of the table are sometimes encoded in the wrong way.
For example one cell contains:
<cell>»¿è²å¼æäºæ çº¿æ¥å¥ç½ç»ä¸çæ³¢ææå½¢ææ¯ç ç©¶</cell>
While it should contain:
<cell>绿色异构云无线接入网络中的波束成形技术研究</cell>
I already tried figuring out what exactly went wrong, but I can't seem to find the correct solution to completely resolve this problem for the whole file. I tried tools such as FTFY, which didn't give me any meaningful result.
These websites gave me some direction and it seems that something went wrong between Windows-1252/1251 and UTF-8. The first website seems to fix the problem but still returns some unknown characters (UTF-8 displayed as Windows-1252).
Does anyone have an idea how to fix this for the whole file? Or give me any tips to further figure it out on my own.
Thanks in advance.

Extracting data from complex output text file using perl and placing into new text file

The complete output text file is hundreds of lines long, with relevant nuclear cross sections and a plethora of other data that I do not need for this particular problem. I am trying to extract the columns of data under "BURNUP" and the first "K-INF" from the file I attached. I am trying to extract this data and place it into a separate file. I am a newbie, and have a similar perl script from a professor. I have tried to adapt it to the information I am looking for but the only result I am receiving are the 2 print statements. Any suggestions?

Reading from Excel gives ####### in some columns

I am using powershell to read data from a .xls file. All goes well, no need to display the code. But I have an issue that I have no idea how to solve.
Some columns in the excel are not wide enough to display their data:
In reality:
2016.04.01
As the column is displayed when opening the file:
########
When I am reading from the excel, the cell content I get in powershell is actually ########, not the actual cell content (= a date in this case).
I am reading 1000's of Excel files, all with the same issue. How can I fix this in my code? Is there a flag I need to set that I don't know about?
Any help would be much appreciated!

Best way to get a database friendly list of Veteran Affairs Hospital

I sincerely apologize if this isn't the proper forum to discuss this, but I wasn't sure where to go or what would be the best option.
Basically, I'm trying to find a database friendly list of veteran affairs hospitals. The closest thing that I've been able to find is www.va.gov/ofcadmin/docs/CATB.pdf as it has all the information I'm looking for:
Region
Address
City in a separate column
Zip Code in a separate column
State
Facility # (also known as StationID)
VISN
Symbol
I've tried exporting that PDF out into CSV but it's a complete nightmare to get working. So, I was curious if anyone had any ideas or insights into how I could accomplish this task.

First, here's a CSV file containing the data found in CATB.pdf. The very first line contains the column headers, and the rest of the file contains the contents.
http://tmp.alexloney.com/CATB.csv
Now, for the more detailed explanation...I took the PDF you provided a link to, converted it to an HTML document using Adobe Acrobat, then I used a lot of Regular Expressions to parse the file and clean it up. Once the file was cleaned up enough, I was able to write a program to parse through the remainder of the file, grab the state and region, and spit it all out in a nicely formatted CSV.
Hope that helps you!

I believe that PDFILL has an option in it that will convert a PDF file to Excell. Once in Excell you should have no problem converting to a CSV file.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse