Finding the format of arbitrary delimited text file in MATLAB - matlab

I have a file that looks like this in notepad++
I can easily see the spaces (being the orange dots), and tabs (being the orange arrows). I can also right click this in MATLAB and import it in a variety of ways. The problem is firstly the delimiters are not consistent. It seems to go TAB then some spaces to make sure the total field equals 6 characters...
The only way I understand reading a file in is if you already know how it is delimited. But in this case I would like to parse each line so MATLAB has some 'token' of what goes where eg:
Line1: Text Space Text Space Text Tab Space Space Text NEWLINE
(Notepad++ seems to know just fine so surely MATLAB can get this info too?).
Is this possible? Then it would be nice to use this information to save the imported data back out to a file with exactly the same formatting.
The data is below. For some reason copying this into notepad++ does not preserve its delimiting, you will need to add the tabs in yourself so it looks like the file in the screenshot.
Average Counts : 56.2
Time : 120
Thanks

If you use textscan, the default behaviour should probably suit your needs:
Within each row of data, the default field delimiter is white space. White space can be any combination of space (' '), backspace ('\b'), or tab ('\t') characters. If you do not specify a delimiter, textscan interprets repeated white-space characters as a single delimiter.
The output is a cell array, where each column is saved as a cell. So C{1} would contain the strings, C{2} the colons, and C{2} the values.

Related

Trouble rendering CSV data as an interactive table in GitHub

When viewed, any .csv file committed to a GitHub repository automatically renders as an interactive table, complete with headers and row numbering. By default, the first row is your header row. The tables were supposed to look nice as below:
However, there's an error happening in my tabular data, and despite indicating the error, I can't fix it:
I'm using a .csv file with a semicolon separator. Does anyone have an idea of what's happening?
According to the docs, Github can only do its lay-out thing with .csv (comma-separated) and .tsv (tab-separated) files.
Using a semicolon as a separator isn't supported, at least not officially, and a spurious comma in a semicolon-separated file could well throw the algorithm off.
You could try replacing all semicolons with tabs and see how you fare.
If that doesn't work, try using commas as separators and enclose all text table cell data with quotes, like:
"Liver fibrosis, sclerosis, and cirrhosis","c370800","102922","Cystic fibrosis related cirrhosis","Diagnosis of liver fibrosis, sclerosis, and cirrhosis"
Note: no spaces after the commas. Also, if you have quotes in the text fields, you will have to escape those to "" (two quotes), or the algorithm will get confused.
You may get away with using quotes only for the offending text data, but that could well be more difficult to generate than just putting the quotes around all fields.

How can I remove tailing white space while loading data from CSV file into an Postgres Table?

I want to remove the trailing whitespaces from CSV file.
Sample CSV file Data:(Delimitor=";")
X ;Y;Z
X1 ; Y1;Z1
X2;Y2; Z2
I would have gone for something like SED or GREP but the file size is huge so it may impact the performance because of preprocessing.
I am looking for a way to remove these whites spaces at the time of loading only.
COPY command does not support preprocessing - you can't do it "at the time of loading "
https://www.postgresql.org/docs/current/static/sql-copy.html
In CSV format, all characters are significant. A quoted value
surrounded by white space, or any characters other than DELIMITER,
will include those characters. This can cause errors if you import
data from a system that pads CSV lines with white space out to some
fixed width. If such a situation arises you might need to preprocess
the CSV file to remove the trailing white space, before importing the
data into PostgreSQL.
I think here the best solution would be importing data with spaces and then
update t set attr = rtim(attr);

TinyMCE converting space to

I am using TinyMCE 4 and in that, if I insert a space in the textarea between two word or characters and then check the source, the space converted to .
I have tried this solution, but that only resolves the issue partially. This is because, if I enter a single space between two characters or words, then TinyMCE doesn't add , but if I add two consecutive spaces between two characters or words, then it makes the second space .
Any work around on this?
TinyMCE is adding hard spaces when you type multiple spaces into the editor - HTML does not show multiple normal whitespace characters so you can't get (per your example) two spaces between letters with just regular spaces. Using hard spaces for every other space allows content authors to use spaces within content and get a rendered result that matches what they type in the editor.
If you render that HTML without hard spaces there would just be one space between each set of characters regardless of how many spaces you put in the HTML source.
The net is that the editor is doing what it needs to do to allow you to see multiple spaces.

Unicode converted text isn't shown properly in MS-Word

In a mapping editor, the display is correct after the legacy to unicode conversion for DEVANAGARI text shown using a unicode font (Arial Unicode MS). However, in MS-WORD, the display isn't as expected for the same unicode text in the unicode font (Arial Unicode MS) or any other Devanagari unicode fonts. The expected sequence of unicodes are provided as per the documentation. The sequence can be seen on the left-hand side table.
Please let me know where I am going wrong.
Thanks for your help!
Does your map have to insert the zero_width_joiner? The halant (virama) by itself is enough to get the half-consonant (for some combinations) and in particular, it may be that Word is using the presence of the ZWJ to keep them separate.
If getting rid of the ZWJ doesn't help, another possibility is that Word may be treating the individual characters of the text string as individual "runs" of text.
If those first 4 characters are not in a single run, this can happen.
[aside: the way to tell if it's being treated as a single run, is to save the document as an xml file and then open it with something like notepad++ and look at the xml "w:t" element (IIRC) associated with these characters. If they're all in separate w:t elements, it means they're in separate runs. In that case, you might need to copy the text from Word to some other tool (e.g. Notepad++) and then copy it from there and paste it back in Word -- that might cause it to be imported into Word in a single run.

JasperReports: Prevent textField to split on space or hyphen

I have a JasperReports template that contains a textField element that will contain a variable length of strings. If the string is too long to fit the width, it'll be split which is fine except when the string contains a hyphen or a space char. In this case the string is split from that char. Below are some example of what's the input, observed outcome and wanted outcome + summary to make my point easier to understand.
input observed wanted summary
_____________________________________
overflow -> overfl -> overfl : OK
over flow -> over -> over f : NOT OK
over-flow -> over- -> over-f : NOT OK
Setting the reportElement to have isStretchWithOverflow="true" attribute will split the text on two lines, but this is not wanted behaviour.
Is there anyway to fix this?
Thanks.
EDIT: The input data comes from an external source, so I cannot directly change that. I ran some tests and noticed that using non-breaking space will do for spaces. non-breaking hyphen on the other hand is not printed at all, i.e. text 'over-flow' becomes 'overflow'. Not quite what is wanted.
Despite the input source not in my control, I could fix this problem by writing a Scriptlet that'll change spaces to non-breaking spaces and hyphens to non-breaking hyphens, only if those darn non-breaking hyphens would be printed.
Printing to PDF by the way, in case that gives some hints of the problem.
Type in the Text field ''Expression'' like this:
String.join("\uFEFF", $F{field1}.split("(?!^)"))
It's a font issue. Non-breaking hyphen works(\u2011) fine, when the font supports it. See fonts sample on how load other than the default font (DejaVu Sans for example).
If anyone has a better option for the input source modification than the Scriptlet, please let me know.
(Marking my own answer as correct in order to get this topic closed.)
EDIT: Have to wait for two days in order to mark this answer correct.