I have a column with xmls
<Options TE="2017/09/01, 16:45:00.000" ST="2017/09/01, 09:00:00.000" TT="2017/09/01, 16:45:00.000"/>
<Options TE="2017/09/01, 16:45:00.000" ST="2017/09/01, 09:00:00.000" TT="2017/09/01, 16:45:00.000"/>
<Options TE="2017/09/04, 16:45:00.000" ST="2017/09/04, 09:00:00.000" TT="2017/09/04, 16:45:00.000"/>
That I am trying to split in columns
TE, ST, TT
The type of the data is C
Not very familiar with kdb/q I tried to go the very manual way. First removed the start and end tags
x:update `$ssr[;"<Options";""] each tags from x
x:update `$ssr[;"/>";""] each string tags from x
leaving me with rows like
TE="2017/09/01, 16:45:00.000" ST="2017/09/01, 09:00:00.000" TT="2017/09/01, 16:45:00.000"
Then, splitting the string
select `$"\"" vs' string tags from x
gives me a list where the odd entries are my times. I just can't figure out how to take that list and split it into separate columns. Any ideas?
I've taken a slightly different approach but the following should do what you want:
//Clean the tags up for separation
//(get rid of open/close tags, change ", " to "," for ease of parsing and remove quote marks)
x:update tags:{ssr/[x;("<Options ";"/>";", ";"\"");("";"";",";"")]} each tags from x
//Parse the various tags using 0:, put the result into a dictionary,
//exec out to table form and add to x
x:x,'exec (!) ./: ("S= " 0:/: tags) from x
For reference here's the table I used:
x:([] tags:("<Options TE=\"2017/09/01, 16:45:00.000\" ST=\"2017/09/01, 09:00:00.000\" TT=\"2017/09/01, 16:45:00.000\"/>";
"<Options TE=\"2017/09/01, 16:45:00.000\" ST=\"2017/09/01, 09:00:00.000\" TT=\"2017/09/01, 16:45:00.000\"/>";
"<Options TE=\"2017/09/04, 16:45:00.000\" ST=\"2017/09/04, 09:00:00.000\" TT=\"2017/09/04, 16:45:00.000\"/>"))
Crazy thought: Is your XML data that regular looking, so that one can select "columns" via indexing. If so, suppose the data (above) was in 3-element list of strings, is it not possible that you apply some function foo to:
foo xmllist[;ind]
where ind selects the data required. The function foo would do the necessary conversion to the timestamp datatype, either by using (types;delimiter) 0: ... ?
see if you can export XML file into JSON file.
kdb+/q has a json parser which does all the dirty work for you.
.j.k and .j.j.
Reference: http://code.kx.com/q/cookbook/websockets/#json
Related
Read in multiple sheets (6) from an xlsx file and created individual dataframes. Want to write each one out to a pipe delimited csv.
ind_dim.to_csv (r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
Currently outputs like this:
1|value1 |value2 |word1 word2 word3 etc.
Want to strip trailing blanks
Suggestion
Include the method .apply(lambda x: x.str.rstrip()) to your output string (prior to the .to_csv() call) to strip the right trailing blank from each field across the DataFrame. It would look like:
Change:
ind_dim.to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
To:
ind_dim.apply(lambda x: x.str.rstrip()).to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
It can be easily inserted to the output code string using '.' referencing. To handle multiple data types, we can enforce the 'object' dtype on import by including the argument dtype='str':
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
Or on the DataFrame itself by:
df = pd.DataFrame(df, dtype='str')
Proof
I did a mock-up where the .xlsx document has 5 sheets, with each sheet having three columns: The first column with all numbers except an empty cell in row 2; the second column with both a leading blank and a trailing blank on strings, an empty cell in row 3, and a number in row 4; and the third column * with all strings having a leading blank, and an empty value in row 4*. Integer indexes and integer columns have been included. The text in each sheet is:
0 1 2
0 11111 valueB1 valueC1
1 valueB2 valueC2
2 33333 valueC3
3 44444 44444
4 55555 valueB5 valueC5
This code reads in our .xlsx testing_xlsx_dtype.xlsx to the DataFrame dictionary ind_dim.
Next, it loops through each sheet using a for loop to place the sheet name variable as a key to reference the individual sheet DataFrame. It applies the .str.rstrip() method to the entire sheet/DataFrame by passing the lambda x: x.str.rstrip() lambda function to the .apply() method called on the sheet/DataFrame.
Finally, it outputs the sheet/DataFrame as a .csv with the pipe delimiter using .to_csv() as seen in the OP post.
# reads xlsx in
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
# loops through sheets, applies rstrip(), output as csv '|' delimit
for sheet in ind_dim:
ind_dim[sheet].apply(lambda x: x.str.rstrip()).to_csv(sheet + '_ind_dim_out.csv', sep='|')
Returns:
|0|1|2
0|11111| valueB1| valueC1
1|| valueB2| valueC2
2|33333|| valueC3
3|44444|44444|
4|55555| valueB5| valueC5
(Note our column 2 strings no longer have the trailing space).
We can also reference each sheet using a loop that cycles through the dictionary items; the syntax would look like for k, v in dict.items() where k and v are the key and value:
# reads xlsx in
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
# loops through sheets, applies rstrip(), output as csv '|' delimit
for k, v in ind_dim.items():
v.apply(lambda x: x.str.rstrip()).to_csv(k + '_ind_dim_out.csv', sep='|')
Notes:
We'll still need to apply the correct arguments for selecting/ignoring indexes and columns with the header= and names= parameters as needed. For these examples I just passed =None for simplicity.
The other methods that strip leading and leading & trailing spaces are: .str.lstrip() and .str.strip() respectively. They can also be applied to an entire DataFrame using the .apply(lambda x: x.str.strip()) lambda function passed to the .apply() method called on the DataFrame.
Only 1 Column: If we only wanted to strip from one column, we can call the .str methods directly on the column itself. For example, to strip leading & trailing spaces from a column named column2 in DataFrame df we would write: df.column2.str.strip().
Data types not string: When importing our data, pandas will assume data types for columns with a similar data type. We can override this by passing dtype='str' to the pd.read_excel() call when importing.
pandas 1.0.1 documentation (04/30/2020) on pandas.read_excel:
"dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion."
We can pass the argument dtype='str' when importing with pd.read_excel.() (as seen above). If we want to enforce a single data type on a DataFrame we are working with, we can set it equal to itself and pass it to pd.DataFrame() with the argument dtype='str like: df = pd.DataFrame(df, dtype='str')
Hope it helps!
The following trims left and right spaces fairly easily:
if (!require(dplyr)) {
install.packages("dplyr")
}
library(dplyr)
if (!require(stringr)) {
install.packages("stringr")
}
library(stringr)
setwd("~/wherever/you/need/to/get/data")
outputWithSpaces <- read.csv("CSVSpace.csv", header = FALSE)
print(head(outputWithSpaces), quote=TRUE)
#str_trim(string, side = c("both", "left", "right"))
outputWithoutSpaces <- outputWithSpaces %>% mutate_all(str_trim)
print(head(outputWithoutSpaces), quote=TRUE)
Starting Data:
V1 V2 V3 V4
1 "Something is interesting. " "This is also Interesting. " "Not " "Intereting "
2 " Something with leading space" " Leading" " Spaces with many words." " More."
3 " Leading and training Space. " " More " " Leading and trailing. " " Spaces. "
Resulting:
V1 V2 V3 V4
1 "Something is interesting." "This is also Interesting." "Not" "Intereting"
2 "Something with leading space" "Leading" "Spaces with many words." "More."
3 "Leading and training Space." "More" "Leading and trailing." "Spaces."
I have a series of DICOM Images which I want to anonymize, I found few Matlab codes and some programs which do the job, but none of them export a .txt file of removed personal information. I was wondering if there is a function which can also save removed personal information of a DICOM images in .txt format for features uses. Also, I am trying to create a table which shows the corresponding new images ID to their real name.(subjects real name = personal-information-removed image ID)
Any thoughts?
Thanks for considering my request!
I'm guessing you only want to output to your text file the fields that are changed by anonymization (either modified, removed, or added). First, you may want to modify some dicomanon options to reduce the number of changes, in particular passing the arguments 'WritePrivate', true to ensure private extensions are kept.
First, you can perform the anonymization, saving structures of pre- and post-anonymization metadata using dicominfo:
preAnonData = dicominfo('input_file.dcm');
dicomanon('input_file.dcm', 'output_file.dcm', 'WritePrivate', true);
postAnonData = dicominfo('output_file.dcm');
Then you can use fieldnames and setdiff to find fields that are removed or added by anonymization, and add them to the post-anonymization or pre-anonymization data, respectively, with a nan value as a place holder:
preFields = fieldnames(preAnonData);
postFields = fieldnames(postAnonData);
removedFields = setdiff(preFields, postFields);
for iField = 1:numel(removedFields)
postAnonData.(removedFields{iField}) = nan;
end
addedFields = setdiff(postFields, preFields);
for iField = 1:numel(addedFields)
preAnonData.(addedFields{iField}) = nan;
end
It will also be helpful to use orderfields so that both data structures have the same ordering for their field names:
postAnonData = orderfields(postAnonData, preAnonData);
Finally, now that each structure has the same fields in the same order we can use struct2cell to convert their field data to a cell array and use cellfun and isequal to find any fields that have been modified by the anonymization:
allFields = fieldnames(preAnonData);
preAnonCell = struct2cell(preAnonData);
postAnonCell = struct2cell(postAnonData);
index = ~cellfun(#isequal, preAnonCell, postAnonCell);
modFields = allFields(index);
Now you can create a table of the changes like so:
T = table(modFields, preAnonCell(index), postAnonCell(index), ...
'VariableNames', {'Field', 'PreAnon', 'PostAnon'});
And you could use writetable to easily output the table data to a text file:
writetable(T, 'anonymized_data.txt');
Note, however, that if any of the fields in the table contain vectors or structures of data, the formatting of your output file may look a little funky (i.e. lots of columns, most of them empty, except for those few fields).
One way to do this is to store the tags before and after anonymisation and use these to write your text file. In Matlab, dicominfo() will read the tags into a structure:
% Get tags before anonymization
tags_before = dicominfo(file_in);
% Anoymize
dicomanon(file_in, file_out); % Need to set tags values where required
% Get tags after anonymization
tags_after = dicominfo(file_out);
% Do something with the two structures
disp(['Patient ID:', tags_before.PatientID ' -> ' tags_after.PatientID]);
disp(['Date of Birth:', tags_before.PatientBirthDate ' -> ' tags_after.PatientBirthDate]);
disp(['Family Name:', tags_before.PatientName.FamilyName ' -> ' tags_after.PatientName.FamilyName]);
You can then write out the before/after fields into a text file. You'd need to modify dicomanon() to choose your own values for the removed fields, since by default they are set to empty.
I have simulation data in an ascii file with a lot of data points. I'm trying to extract variable names and their values from it. The below is an example of what the file format looks like:
*ESA
*COM on Tue Sep 27 15:23:02 2016
*COM C:\Users\vi813c\Documents\My Matlab\
*COM The pathname to the ESB file was: C:\Users\vi813c\Documents\My Matlab
Case013
*RTITLE
Run Date/Time = 20-SEP-2016 13:29:00
MSC.EASY5 time-history plot with 20001 data points
*EOD
*FLOAT
TIME FDLB(1) FSLB(1) FVLB(1) MXLB(1) \
MYLB(1) MZLB(1) FDLB(2) FSLB(2) FVLB(2) \
MXLB(2) MYLB(2) MZLB(2) FDLB(3) FSLB(3) \
FVLB(3) MXLB(3) MYLB(3) MZLB(3)
0 884.439 -0 53645.8 -972.132
-311780 207.866 5403.68 1981.49 327781
258746 -1.74898E+006 84631.4 5384.25 -1308.47
326538 -97028.6 -1.74013E+006 -61858.1
0.002 882.616 0.008033 53661.1 -972.4
-311702 207.779 5400.42 1982.11 327784
258726 -1.74906E+006 84628.3 5381.01 -1308.44
326541 -97040.1 -1.74021E+006 -61858.8
0.004 876.819 0.031336 53705.6 -973.183
-311683 207.661 5391.19 1983.9 327795
258693 -1.74935E+006 84624 5371.85 -1309.63
326552 -97040.6 -1.74051E+006 -61858.8
0.006 869.491 0.061631 53763.3 -974.213
-311806 207.618 5377.45 1986.76 327813
258659 -1.74995E+006 84621.7 5358.2 -1312.04
326569 -97040.3 -1.7411E+006 -61861
0.008 861.718 0.095625 53828.1 -975.379
-312039 207.648 5360.82 1990.12 327834
A summary of data format characteristics is as follows:
Everything above "*FLOAT" is a header and I need to get rid of it
Stuff between "*FLOAT" and the first numeric value are the variable names
The variable names and the values are delimited by space(s) and '\'
The data are "lumped". Each lump has values for the variables at a given simulation time step. In the example above, there are 19 variables so that there are 19 numeric values in each lump
There can be multiple data sets; each preceded with "*FLOAT" and a variable name section
The following is how I am currently handling this data:
fileread the file --> one big string of characters
regexprep {'\s+,'\','\n'} with ',' --> comma delimited for strsplit
strfind "*FLOAT"
strsplit by ',' --> now becomes a cell
find the first numeric value by isnan(str2double(parse))
Then between the index from 2. and the index from 4 are the variable names and between the index from 4 and the next "*FLOAT" are the numeric data
This scheme is sort of working, but I can't stop thinking that there's gotta be a better way to do this. For one, the step 1. is extremely slow. I guess it's one big string for regexprep to work on with multiple things to replace.
How can I improve my script?
I gave this a shot with the string class which is new in 16b.
str = string(fileread('file.txt'));
fileNewline = [13 newline]; % This data has carriage returns
str = extractAfter(str, ['*FLOAT' fileNewline]);
str = erase(str, ['\' fileNewline]);
str = splitlines(str);
% Get the variable names
varNames = split(str(1))';
% Get the data
data = reshape(str(2:end), 4, [])';
data = strip(data);
data = join(data);
data = split(data);
data = double(data);
I'm not sure about how to load the file faster.
As mentioned in another comment, textscan could probably help. It might end up being the fastest solution. With the correct format specified and using the 'HeaderLines' option, I think you can make it work.
I want to use fscanf for reading a text file containing 4 rows with an unknown number of columns. The newline is represented by two consecutive spaces.
It was suggested that I pass : as the sizeA parameter but it doesn't work.
How can I read in my data?
update: The file format is
String1 String2 String3
10 20 30
a b c
1 2 3
I have to fill 4 arrays, one for each row.
See if this will work for your application.
fid1=fopen('test.txt');
i=1;
check=0;
while check~=1
str=fscanf(fid1,'%s',1);
if strcmp(str,'')~=1;
string(i)={str};
end
i=i+1;
check=strcmp(str,'');
end
fclose(fid1);
X=reshape(string,[],4);
ar1=X(:,1)
ar2=X(:,2)
ar3=X(:,3)
ar4=X(:,4)
Once you have 'ar1','ar2','ar3','ar4' you can parse them however you want.
I have found a solution, i don't know if it is the only one but it works fine:
A=fscanf(fid,'%[^\n] *\n')
B=sscanf(A,'%c ')
Z=fscanf(fid,'%[^\n] *\n')
C=sscanf(Z,'%d')
....
You could use
rawText = getl(fid);
lines = regexp(thisLine,' ','split);
tokens = {};
for ix = 1:numel(lines)
tokens{end+1} = regexp(lines{ix},' ','split'};
end
This will give you a cell array of strings having the row and column shape or your original data.
To read an arbitrary line of text then break it up according the the formating information you have available. My example uses a single space character.
This uses regular expressions to define the separator. Regular expressions powerful but too complex to describe here. See the MATLAB help for regexp and regular expressions.
I'm writing a program in Progress, OpenEdge, ABL, and whatever else it's known as.
I have a CSV file that is delimited by commas. However, there is a "gift message" field, and users enter messages with "commas", so now my program will see additional entries because of those bad commas.
The CSV fields are not in double qoutes so I CAN NOT just use my main method with is
/** this next block of code will remove all unwanted commas from the data. **/
if v-line-cnt > 1 then /** we won't run this against the headers. Otherwise thhey will get deleted **/
assign
v-data = replace(v-data,'","',"\t") /** Here is a special technique to replace the comma delim wiht a tab **/
v-data = replace(v-data,','," ") /** now that we removed the comma delim above, we can remove all nuisance commas **/
v-data = replace(v-data,"\t",'","'). /** all nuisance commas are gone, we turn the tabs back to commas. **/
Any advice?
edit:
From Progress, I cal call Linux commands. So I should be able to execute C++/PHP/Shell etc all from my Progress Program. I look forward to advice, until then I shall look into using external scripts.
You are not providing quite enough data for a perfect answer but given what you say I think the IMPORT statement should handle this automatically.
In my example here commaimport.csv is a comma-separated csv-file with quotes around text fields. Integers, logical variables etc have no quotes. The last field contains a comma in one line:
commaimport.csv
=======================
"Id1", 123, NO, "This is a message"
"Id2", 124, YES, "This is a another message, with a comma"
"Id3", 323, NO, "This is a another message without a comma"
To import this file I define a temp-table matching the file layout and use the IMPORT statement with comma as delimiter:
DEFINE TEMP-TABLE ttImport NO-UNDO
FIELD field1 AS CHARACTER FORMAT "xxx"
FIELD field2 AS INTEGER FORMAT "zz9"
FIELD field3 AS LOGICAL
FIELD field4 AS CHARACTER FORMAT "x(50)".
INPUT FROM VALUE("c:\temp\commaimport.csv").
REPEAT :
CREATE ttImport.
IMPORT DELIMITER "," ttImport.
END.
INPUT CLOSE.
FOR EACH ttImport:
DISPLAY ttImport.
END.
You don't have to import into a temp-table. You could import into variables instead.
DEFINE VARIABLE c AS CHARACTER NO-UNDO FORMAT "xxx".
DEFINE VARIABLE i AS INTEGER NO-UNDO FORMAT "zz9".
DEFINE VARIABLE l AS LOGICAL NO-UNDO.
DEFINE VARIABLE d AS CHARACTER NO-UNDO FORMAT "x(50)".
INPUT FROM VALUE("c:\temp\commaimport.csv").
REPEAT :
IMPORT DELIMITER "," c i l d.
DISP c i l d.
END.
INPUT CLOSE.
This will render basically the same output:
You don't show what your data file looks like. But if the problematic field is the last one, and there are no quotes, then your best bet is probably to read it using INPUT UNFORMATTED to get it a line at a time, and then split the line into fields using ENTRY(). That way you can treat everything after the nth comma as a single field no matter how many commas the line has.
For example, say your input file has three columns like this:
boris,14.23,12 the avenue
mark,32.10,flat 1, the grange
percy,1.00,Bleak house, Dartmouth
... so that column three is an address which might contain a comma and is not enclosed in quotes so that IMPORT DELIMITER can't help you.
Something like this would work in that case:
/* ...skipping a lot of definitions here ... */
input from "datafile.csv".
repeat:
import unformatted v-line.
create tt-thing.
assign tt-thing.name = entry(1, v-line, ',')
tt-thing.price = entry(2, v-line, ',')
tt-thing.address = entry(3, v-line, ',').
do v=i = 4 to num-entries(v-line, ','):
tt-thing.address = tt-thing.address
+ ','
+ entry(v-i, v-line, ',').
end.
end.
input close.