Why does Open XML API Import Text Formatted Column Cell Rows Differently For Every Row - openxml

I am working on an ingestion feature that will take a strongly formatted .xlsx file and import the records to a temp storage table and then process the rows to create db records.
One of the columns is strictly formatted as "Text" but it seems like the Open XML API handles the columns cells differently on a row-by-row basis. Some of the values while appearing to be numeric values are truly not (which is why we format the column as Text) -
some examples are "211377", "211727.01", "209395.388", "209395.435"
what these values represent is not important but what happens is that some values (using the Open XML API v2.5 library) will be read in properly as text whether retrieved from the Shared Strings collection or simply from InnerXML property while others get sucked in as numbers with what appears to be appended rounding or precision.
For example the "211377", "211727.01" and "209395.435" all come in exactly as they are in the spreadsheet but the "209395.388" value is being pulled in as "209395.38800000001" (there are others that this happens to as well).
There seems to be no rhyme or reason to which values get messed up and which ones which import fine. What is really frustrating is that if I use the native Import feature in SQL Server Management Studio and ingest the same spreadsheet to a temp table this does not happen - so how is that the SSMS import can handle these values as purely text for all rows but the Open XML API cannot.

To begin the answer you main problem seems to be values,
"209395.388" value is being pulled in as "209395.38800000001"
Yes in .xlsx file value is stored as 209395.38800000001 instead of 209395.388. And it's the correct format to store floating point numbers; nothing wrong in it. You van simply confirm it by following code snippet
string val = "209395.38800000001"; // <= What we extract from Open Xml
Console.WriteLine(double.Parse(val)); // < = Simply pass it to double and print
The output is :
209395.388 // <= yes the expected value
So there's nothing wrong in the value you extract from .xlsx using Open Xml SDK.
Now to cells, yes cell can have verity of formats. Numbers, text, boleans or shared string text. And you can styles to a cell which would format your string to a desired output in Excel. (Ex - Date Time format, Forced strings etc.). And this the way Excel handle the vast verity of data. It need this kind of formatting and .xlsx file format had to be little complex to support all.
My advice is to use a proper parse method set at extracted values to identify what format it represent (For example to determine whether its a number or a text) and apply what type of parse.
ex : -
string val = "209395.38800000001";
Console.WriteLine(float.Parse(val)); // <= Float parse will be deduce a different value ; 209395.4
Update :
Here's how value is saved in internal XML
Try for yourself ;
Make an .xlsx file with value 209395.388 -> Change extention to .zip -> Unzip it -> goto worksheet folder -> open Sheet1
You will notice that value is stored as 209395.38800000001 as scene in attached image.. So nothing wrong on API for extracting stored number. It's your duty to decide what format to apply.
But if you make the whole column Text before adding data, you will see that .xlsx hold data as it is; simply said as string.

Related

Using the toInteger function with locale and format parameters

I've got a dataflow with a csv file as source. The column NewPositive is a string and it contains numbers formatted in European style with a dot as thousand seperator e.g 1.019 meaning 1019
If I use the function toInteger to convert my NewPositive column to an int via toInteger(NewPositive,'#.###','de'), I only get the thousand cipher e.g 1 for 1.019 and not the rest. Why? For testing I tried creating a constant column: toInteger('1.019','#.###','de') and it gives 1019 as expected. So why does the function not work for my column? The column is trimmed and if I compare the first value with equality function: equals('1.019',NewPositive) returns true.
Please note: I know it's very easy to create a workaround by toInteger(replace(NewPositive,'.','')), but I want to learn how to use the toInteger function with the locale and format parameters.
Here is sample data:
Dato;NewPositive
2021-08-20;1.234
2021-08-21;1.789
I was able to repro this and probably looks to be a bug to me . I have reported this to the ADF team , will let you know once I hear back from them . You already have a work around please go ahead that to unblock yourself .

how to read CSV file in scala

I have a CSV file and I want to read that file and store it in case class. As I know A CSV is a comma separated values file. But in case of my csv file there are some data which have already comma itself. and it creates new column for every comma. So the problem how to split data from that.
1st data
04/20/2021 16:20(1st column) Here a bunch of basic techniques that suit most businesses, and easy-to-follow steps that can help you create a strategy for your social media marketing goals.(2nd column)
2nd data
11-07-2021 12:15(1st column) Focus on attracting real followers who are genuinely interested in your content, and make the most of your social media marketing efforts.(2nd column)
var i=0
var length=0
val data=Source.fromFile(file)
for (line <- data.getLines) {
val cols = line.split(",").map(_.trim)
length = cols.length
while(i<length){
//println(cols(i))
i=i+1
}
i=0
}
If you are reading a complex CSV file then the ideal solution is to use an existing library. Here is a link to the ScalaDex search results for CSV.
ScalaDex CSV Search
However, based on the comments, it appears that you might actually be wanting to read data stored in a Google Sheet. If that is the case, you can utilize the fact that you have some flexibility to save the data in a text file yourself. When I want to read data from a Google Sheet in Scala, the approach I use first is to save the file in a format that isn't hard to read. If the fields have embedded commas but no tabs, which is common, then I will save the file as a TSV and parse that with split("\t").
A simple bit of code that only uses the standard library might look like the following:
val source = scala.io.Source.fromFile("data.tsv")
val data = source.getLines.map(_.split("\t")).toArray
source.close
After this, data will be an Array[Array[String]] with your data in it that you can process as you desire.
Of course, if your data includes both tabs and commas then you'll really want to use one of those more robust external libraries.
You could use univocity CSV parser for faster stuffs.
You can also use it for creation as well.
Univocity parsers

How do I prevent users to use thousands separator in FileMaker Pro?

In FileMaker Pro, when using number field, the user can choose to use a thousand separator or not. For example, if I have a database with a field for the price of an item, the user can either enter 1,000 or 1000.
I am using my database to generate an XML file that needs to be uploaded. The thing is, that my XML scheme dictates that only a value of 1000 is allowed and not 1,000. Therefore, I want to either automatically remove the comma, or (my preference in this case) alert the user when trying to enter a value with a thousand separator.
What I tried is the following.
For the field, I am setting Validation options. For example:
Require Strict data type: Numeric Only
Validated by calculation: Position ( Self ; ","; 1 ; 1 ) = 0
Validated by calculation: Self = Substitue ( Self, ",", "")
Auto-enter calculation: Filter( Self ; "0123456789." )
Unfortunately, none of these work. As the field is defined as a number (and I want to keep it like this, as I am also performing calculations based on this number), the Position function and the Substitute function apparently ignore the thousand separator!
EDIT:
Note that I am generating my XML by concatenating a string, for example:
"<Products><Product><Name>" & Name & "</Name><Price>" & Price & "</Price></Product></Product>"
The reason is that what I am exporting is dependent on the values in my database. Therefore, I am not using the [File][Export records...] function.
Auto-enter calculation will work, but you need to uncheck the box "Do not replace existing value of field" (which is checked by default).
I'd suggest using the calculation GetAsNumber(self) as the auto-enter calc. If it should only contain integers, wrap that in a call to Int()
I am using my database to generate an XML file that needs to be uploaded. The thing is, that my XML scheme dictates that only a value of 1000 is allowed and not 1,000.
If this is only a problem when you export, why not handle it when exporting?
If you are exporting as XML using XSLT, you can add an instruction to
your stylesheet to remove the comma from all number fields;
Alternatively, you can export from a layout where the field is
formatted to display without the comma and select the Apply current's layout data formatting to exported data option when
exporting.
Added:
Perhaps I should have clarified. I am not using the export function to generate the XML as there is some logic involved in how the XML should be formatted (dependent on the data that I want to export). What I do instead is that I make a string where I combine XML-tags and actual values from the database.
IMHO, you're making a mistake by not taking advantage of the built-in XML/XSLT export option. Any imaginable logic can be implemented this way, without burdening your solution with the fragile task of creating a valid XML.
In any case, if you're using the field in a calculation, you can replace all references to it with:
GetAsNumber (YourField )
to get an unformatted, numeric-only, value.
Your question puzzles me. As far as I know, FileMaker does not store the thousands separator, but rather offers it only as a display option.
That's also why those functions can't find it.
Are you sure you are exporting the raw data and not a "formatted as layout" variant?

Matlab save sequence of mat files from convertTDMS stored in cell array to sequence of mat files

I have data stored in the .tdms format, gathering the data of many sensors, measured every second, every day. A new tdms file is created every day, and stored in a folder per month. Using the convertTDMS function, I have converted these tdms files to mat files.
As there are some errors in some of the measurements(e.g. negative values which can not physically occur), I have performed some corrections by loading one mat file at a time, do the calculations and then save the data into the original .mat file.
However, when I try to do what I described above in a loop (so: load .mat in folder, do calculations on one mat file (or channel therein), save mat file, repeat until all files in the folder have been done), I end up running into trouble with the limitations of the save function: so far I save all variables (or am unable to save) in the workspace when using the code below.
for k = 1:nFiles
w{k,1} = load(wMAT{k,1});
len = length(w{k,1}.(x).(y).(z));
pos = find(w{k,1}.(x).(y).(z)(1,len).(y)<0); %Wind speed must be >0 m/s
for n = 1:length(pos)
w{k,1}.(x).(y).(z)(1,len).(y)(pos(n)) = mean([w{k,1}.(x).(y).(z)(1,len).(y)(pos(n)+1),...
w{k,1}.(x).(y).(z)(1,len).(y)(pos(n)-1)],2);
end
save( name{k,1});
%save(wMAT{k,1},w{k,1}.(x),w{k,1}.ConvertVer,w{k,1}.ChanNames);
end
A bit of background information: the file names are stored in a cell array wMAT of length nFiles in the folder. Each cell in the cell array wMAT stores the fullfile path to the mat files.
The data of the files is loaded and saved into the cell array w, also of length nFiles.
Each cell in "w" has all the data stored from the tdms to mat conversion, in the format described in the convertTDMS description.
This means: to get at the actual data, I need to go from the
cell in the cell array w{k,1} (my addition)
to the struct array "ConvertedData" (Structure of all of the data objects - part of convertTDMS)
to the struct array below called "Data" (convertTDMS)
to the struct array below called "MeasuredData" (convertTDMS) -> at this level, I can access the channels which store the data.
to finally access/manipulate the values stored, I have to select a channel, e.g. (1,len), and then go via the struct array to the actual values (="Data"). (convertTDMS)
In Matlab format, this looks like "w{1, 1}.ConvertedData.Data.MeasuredData(1, len).Data(1:end)" or "w{1, 1}.ConvertedData.Data.MeasuredData(1, len).Data".
To make typing easier, I took
x = 'ConvertedData';
y = 'Data';
z = 'MeasuredData';
allowing me to write instead:
w{k,1}.(x).(y).(z)(1,len).(y)
using the dot notation.
My goal/question: I want to load the values stored in a .mat file from the original .tdms files in a loop to a cell array (or if I can do better than a cell array: please tell me), do the necessary calculations, and then save each 'corrected' .mat file using the original name.
So far, I have gotten a multitude of errors from trying a variety of solutions, going from "getfieldnames", trying to pass the name of the (dynamically changing) variable(s), etc.
Similar questions which have helped me get in the right direction include Saving matlab files with a name having a variable input, Dynamically Assign Variables in Matlab and http://www.mathworks.com/matlabcentral/answers/4042-load-files-containing-part-of-a-string-in-the-file-name-and-the-load-that-file , yet the result is that I am still no closer than doing manual labour in this case.
Any help would be appreciated.
If I understand your ultimate goal correctly, I think you're pretty much there. I think you're trying to process your .mat files and that the loading of all of the files into a cell array is not a requirement, but just part of your solution? Assuming this is the case, you could just load the data from one file, process it, save it and then repeat. This way you only ever have one file loaded at a time and shouldn't hit any limits.
Edit
You could certainly make a function out of your code and then call that in a loop, passing in the file name to modify. Personally I'd probably do that as I think it's neater solution. If you don't want to do that though, you could just replace w{k,1} with w then each time you load a file w would be overwritten. If you wanted to explicitly clear variables you can use the clear command with a space separated list of variables e.g. clear w len pos, but I don't think that this is necessary.

Search for all rows with values in excel with perl

I am trying to parse data from excel using perl into a specific XML format, which is read by another application, which then creates a graphical representation of the data. The excel sheet gets updated by data capturers on a regular base. I have opted for the Spreadsheet::Read module but I am having an issue.
The sheet would be divided into specific cells, which corresponds to a specific layout.
The sheet will thus have:
Country | City | Suburb | link
Each row being different data etc.
I tried to tell the script to get every row like this.
use Spreadsheet::Read;
my $book = ReadData ("Captured_input.xlsx");
my #rows = Spreadsheet::Read::rows ($book->[1]);
print "#rows\n";
This however prints ARRAY(and some hex data)
However, I want it to read each row and cell and return it like this:
Country1, City1, Suburb1,link1
Country2, City2, Suburb2,link2
It runs on a daily base so it should not read end of file only, it should read the entire sheet each time, so if any changes were made it will republish the XML.
If I use it like this, it works and returns the data, but I then need to manually specify each row, but I cannot estimate how many rows there will be in future.
use Spreadsheet::Read;
my $book = ReadData ("Captured_input.xlsx");
my #row = Spreadsheet::Read::cellrow ($book->[1], 4);
print "#row\n,";
Some input would be greatly appreciated.
read perlreftut.
read perldsc.
read the Spreadsheet::Read documentation.
The documentation shows quite well what data structure the $book will contain. Once you understand how references work (see the first two links), handling the data will be easy.