Search for all rows with values in excel with perl - perl

I am trying to parse data from excel using perl into a specific XML format, which is read by another application, which then creates a graphical representation of the data. The excel sheet gets updated by data capturers on a regular base. I have opted for the Spreadsheet::Read module but I am having an issue.
The sheet would be divided into specific cells, which corresponds to a specific layout.
The sheet will thus have:
Country | City | Suburb | link
Each row being different data etc.
I tried to tell the script to get every row like this.
use Spreadsheet::Read;
my $book = ReadData ("Captured_input.xlsx");
my #rows = Spreadsheet::Read::rows ($book->[1]);
print "#rows\n";
This however prints ARRAY(and some hex data)
However, I want it to read each row and cell and return it like this:
Country1, City1, Suburb1,link1
Country2, City2, Suburb2,link2
It runs on a daily base so it should not read end of file only, it should read the entire sheet each time, so if any changes were made it will republish the XML.
If I use it like this, it works and returns the data, but I then need to manually specify each row, but I cannot estimate how many rows there will be in future.
use Spreadsheet::Read;
my $book = ReadData ("Captured_input.xlsx");
my #row = Spreadsheet::Read::cellrow ($book->[1], 4);
print "#row\n,";
Some input would be greatly appreciated.

read perlreftut.
read perldsc.
read the Spreadsheet::Read documentation.
The documentation shows quite well what data structure the $book will contain. Once you understand how references work (see the first two links), handling the data will be easy.

Related

how to read CSV file in scala

I have a CSV file and I want to read that file and store it in case class. As I know A CSV is a comma separated values file. But in case of my csv file there are some data which have already comma itself. and it creates new column for every comma. So the problem how to split data from that.
1st data
04/20/2021 16:20(1st column) Here a bunch of basic techniques that suit most businesses, and easy-to-follow steps that can help you create a strategy for your social media marketing goals.(2nd column)
2nd data
11-07-2021 12:15(1st column) Focus on attracting real followers who are genuinely interested in your content, and make the most of your social media marketing efforts.(2nd column)
var i=0
var length=0
val data=Source.fromFile(file)
for (line <- data.getLines) {
val cols = line.split(",").map(_.trim)
length = cols.length
while(i<length){
//println(cols(i))
i=i+1
}
i=0
}
If you are reading a complex CSV file then the ideal solution is to use an existing library. Here is a link to the ScalaDex search results for CSV.
ScalaDex CSV Search
However, based on the comments, it appears that you might actually be wanting to read data stored in a Google Sheet. If that is the case, you can utilize the fact that you have some flexibility to save the data in a text file yourself. When I want to read data from a Google Sheet in Scala, the approach I use first is to save the file in a format that isn't hard to read. If the fields have embedded commas but no tabs, which is common, then I will save the file as a TSV and parse that with split("\t").
A simple bit of code that only uses the standard library might look like the following:
val source = scala.io.Source.fromFile("data.tsv")
val data = source.getLines.map(_.split("\t")).toArray
source.close
After this, data will be an Array[Array[String]] with your data in it that you can process as you desire.
Of course, if your data includes both tabs and commas then you'll really want to use one of those more robust external libraries.
You could use univocity CSV parser for faster stuffs.
You can also use it for creation as well.
Univocity parsers

XML::Twig parsing same name tag in same path

I am trying to help out a client who was unhappy with an EMR (Electronic Medical Records) system and wanted to switch but the company said they couldn't extract patient demographic data from the database (we asked if they can get us name, address, dob in a csv file of some sort, very basic stuff) - yet they claim they couldn't do that. (crazy considering they are using a sql database).
Anyway - the way they handed over the patients were in xml files and there are about 40'000+ of them. But they contain a lot more than the demographics.
After doing some research and having done extensive Perl programming 15 years ago (I admit it got rusty over the years) - I thought this should be a good task to get done in Perl - and I came across the XML::Twig module which seems to be able to do the trick.
Unfortunately the xml code that is of interest looks like this:
<==snip==>
<patient extension="Patient ID Number"> // <--Patient ID is 5 digit number)
<name>
<family>Patient Family name</family>
<given>Patient First/Given name</given>
<given>Patient Middle Initial</given>
</name>
<birthTime value=YEARMMDD"/>
more fields for address etc.are following in the xml file.
<==snip==>
Here is what I coded:
my $twig=XML::Twig->new( twig_handlers => {
'patient/name/family' => \&get_family_name,
'patient/name/given' => \&get_given_name
});
$twig->parsefile('test.xml');
my #fields;
sub get_family_name {my($twig,$data)=#_;$fields[0]=$data->text;$twig->purge;}
sub get_given_name {my($twig,$data)=#_;$fields[1]=$data->text;$twig->purge;}
I have no problems reading out all the information that have unique tags (family, city, zip code, etc.) but XML:Twig only returns the middle initial for the tag.
How can I address the first occurrence of "given" and assign it to $fields[1] and the second occurrence of "given" to $fields[2] for instance - or chuck the middle initial.
Also how do I extract the "Patient ID" or the "birthTime" value with XML::Twig - I couldn't find a reference to that.
I tried using $data->findvalue('birthTime') but that came back empty.
I looked at: Perl, XML::Twig, how to reading field with the same tag which was very helpful but since the duplicate tags are in the same path it is different and I can't seem to find an answer. Does XML::Twig only return the last value found when finding a match while parsing a file? Is there a way to extract all occurrences of a value?
Thank you for your help in advance!
It is very easy to assume from the documentation that you're supposed to use callbacks for everything. But it's just as valid to parse the whole document and interrogate it in its entirety, especially if the data size is small
It's unclear from your question whether each patient has a separate XML file to themselves, and you don't show what encloses the patient elements, but I suggest that you use a compromise approach and write a handler for just the patient elements which extracts all of the information required
I've chosen to build a hash of information %patient out of each patient element and push it onto an array #patients that contains all the data in the file. If you have only one patient per file then this will need to be changed
I've resolved the problem with the name/given elements by fetching all of them and joining them into a single string with intervening spaces. I hope that's suitable
This is completely untested as I have only a tablet to hand at present, so beware. It does stand a chance of compiling, but I would be surprised if it has no bugs
use strict;
use warnings 'all';
use XML::Twig;
my #patients;
my $twig = XML::Twig->new(
twig_handlers => { patient => \&get_patient }
);
$twig->parsefile('test.xml');
sub get_patient {
my ($twig, $pat) = #_;
my %patient;
$patient{id} = $pat>att('extension');
my $name = $pat->first_child('name');yy
$patient{family} = $name->first_child_trimmed_text('family');
$patient{given} = join ' ', $name->children_trimmed_text('given');
$patient{dob} = $pat->first_child('birthTime')->att('value');
push #patients, \%patient;
}

Exporting the output of MATLAB's methodsview

MATLAB's methodsview tool is handy when exploring the API provided by external classes (Java, COM, etc.). Below is an example of how this function works:
myApp = actxserver('Excel.Application');
methodsview(myApp)
I want to keep the information in this window for future reference, by exporting it to a table, a cell array of strings, a .csv or another similar format, preferably without using external tools.
Some things I tried:
This window allows selecting one line at a time and doing "Ctrl+c Ctrl+v" on it, which results in a tab-separated text that looks like this:
Variant GetCustomListContents (handle, int32)
Such a strategy can work when there are only several methods, but not viable for (the usually-encountered) long lists.
I could not find a way to access the table data via the figure handle (w/o using external tools like findjobj or uiinspect), as findall(0,'Type','Figure') "do not see" the methodsview window/figure at all.
My MATLAB version is R2015a.
Fortunately, methodsview.m file is accessible and allows to get some insight on how the function works. Inside is the following comment:
%// Internal use only: option is optional and if present and equal to
%// 'noUI' this function returns methods information without displaying
%// the table. `
After some trial and error, I saw that the following works:
[titles,data] = methodsview(myApp,'noui');
... and returns two arrays of type java.lang.String[][].
From there I found a couple of ways to present the data in a meaningful way:
Table:
dataTable = cell2table(cell(data));
dataTable.Properties.VariableNames = matlab.lang.makeValidName(cell(titles));
Cell array:
dataCell = [cell(titles).'; cell(data)];
Important note: In the table case, the "Return Type" column title gets renamed to ReturnType, since table titles have to be valid MATLAB identifiers, as mentioned in the docs.

Why does Open XML API Import Text Formatted Column Cell Rows Differently For Every Row

I am working on an ingestion feature that will take a strongly formatted .xlsx file and import the records to a temp storage table and then process the rows to create db records.
One of the columns is strictly formatted as "Text" but it seems like the Open XML API handles the columns cells differently on a row-by-row basis. Some of the values while appearing to be numeric values are truly not (which is why we format the column as Text) -
some examples are "211377", "211727.01", "209395.388", "209395.435"
what these values represent is not important but what happens is that some values (using the Open XML API v2.5 library) will be read in properly as text whether retrieved from the Shared Strings collection or simply from InnerXML property while others get sucked in as numbers with what appears to be appended rounding or precision.
For example the "211377", "211727.01" and "209395.435" all come in exactly as they are in the spreadsheet but the "209395.388" value is being pulled in as "209395.38800000001" (there are others that this happens to as well).
There seems to be no rhyme or reason to which values get messed up and which ones which import fine. What is really frustrating is that if I use the native Import feature in SQL Server Management Studio and ingest the same spreadsheet to a temp table this does not happen - so how is that the SSMS import can handle these values as purely text for all rows but the Open XML API cannot.
To begin the answer you main problem seems to be values,
"209395.388" value is being pulled in as "209395.38800000001"
Yes in .xlsx file value is stored as 209395.38800000001 instead of 209395.388. And it's the correct format to store floating point numbers; nothing wrong in it. You van simply confirm it by following code snippet
string val = "209395.38800000001"; // <= What we extract from Open Xml
Console.WriteLine(double.Parse(val)); // < = Simply pass it to double and print
The output is :
209395.388 // <= yes the expected value
So there's nothing wrong in the value you extract from .xlsx using Open Xml SDK.
Now to cells, yes cell can have verity of formats. Numbers, text, boleans or shared string text. And you can styles to a cell which would format your string to a desired output in Excel. (Ex - Date Time format, Forced strings etc.). And this the way Excel handle the vast verity of data. It need this kind of formatting and .xlsx file format had to be little complex to support all.
My advice is to use a proper parse method set at extracted values to identify what format it represent (For example to determine whether its a number or a text) and apply what type of parse.
ex : -
string val = "209395.38800000001";
Console.WriteLine(float.Parse(val)); // <= Float parse will be deduce a different value ; 209395.4
Update :
Here's how value is saved in internal XML
Try for yourself ;
Make an .xlsx file with value 209395.388 -> Change extention to .zip -> Unzip it -> goto worksheet folder -> open Sheet1
You will notice that value is stored as 209395.38800000001 as scene in attached image.. So nothing wrong on API for extracting stored number. It's your duty to decide what format to apply.
But if you make the whole column Text before adding data, you will see that .xlsx hold data as it is; simply said as string.

Extract portions of Google Docs form data and write it on separate lines

Let's say I have a Google Docs Form that gathers the following info:
Timestamp (default field)
Names
Ref#
The form data then appears on the spreadsheet as follows:
4/10/2013 16:20:31 | Jack, Jill, Oscar | Ref6656X
(Note: the number of names may be anywhere from 1 to many)
I need the data to appear on the spreadsheet as follows:
4/10/2013 16:20:31 | Jack | Ref6656X
4/10/2013 16:20:31 | Jill | Ref6656X
4/10/2013 16:20:31 | Oscar | Ref6656X
I can often decipher and edit Google Apps Script (JavaScript?), but I don't know how to think in that language in order to create it for myself (especially with an unknown number of names in the Name field). How can I get started on solving this?
First of all, you've got some choices to make before you start writing your code.
Do you want to modify the spreadsheet that's accepting form input, or produce a separate sheet that has the modified data? If you want to have a record of what was actually input by a user, you'd best leave the original data alone. If you're using a second sheet for the massaged output, the presence of multiple tabs might be confusing to your users, unless you take steps to hide it.
Do you want to do the modifications as forms come in, or (in bulk) at some point afterwards? If you already have collected data, you'll have to have the bulk processing, and that will involve looping and having to handle insertions of new rows in the middle of things. To handle forms as they come in, you'll need to set up a function that is triggered by form submissions, and only extend the table further down... but you've got more learning to do - see Container-Specific Triggers, Understanding Triggers and Understanding Events for background info.
Will you primarily use Spreadsheet service functions, or javascript Arrays? This choice is often about speed - the more you can do in javascript, the faster your script will be, but switching between the two can be confusing at first.
Here's an example function to do bulk processing. It reads all existing data into an array, goes through that and copies all rows into a new array, expanding multiple names into multiple rows. When done, the existing sheet data is overwritten. (Note - not debugged or tested.)
function bulkProcess() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var dataIn = ss.getDataRange().getValues();
var dataOut = [];
for (var row in dataIn) { // Could use: for (var row = 0; row < dataIn.length; row++)
var names = dataIn[row][1].split(','); // array of names in second column
var rowOut = dataIn[row];
for (var i in names) {
rowOut[1] = names[i]; // overwrite with single name
dataOut.push(rowOut); // then copy to dataOut array
}
}
// Write the updated array back to spreadsheet, overwriting existing values.
ss.getRange(1,1,dataOut.length,dataOut[0].length).setValues(dataOut);
}