I am unable parse date info from a csv file into ipython - date

I am running python 3.5, I have imported pandas. My csv file (payinfo.csv) looks like:
"01 DEC",1234.45,2344,11,1212.66
"01 NOV", 9898.33, 2343,12,1009.33
When I run the following:
dateparse = lambda x: pd.datetime.strptime(x,"%d %b")
pay_data = pd.read_csv('payinfo.csv', parse_dates = ['Date'], date_parse
I always get
"ValueError: time data '“01 DEC”' does not match format '%d %b'
I am a new programmer to python, and would appreciate any help.

I think it was just the double quotes around string that caused that error. Try stripping away any hardcoded (not 'python generated') single or double quote marks with .strip('"')
Example:
a = '"01 DEC"'
# Gives error
#a = pd.datetime.strptime(a,"%d %b")
# string without unneccessary quote marks
a = pd.datetime.strptime(a.strip('"'),"%d %b")
print a
Output:
1900-12-01 00:00:00

You haven't included the headers in the question. But this works:
import io
import pandas as pd
a = io.StringIO(u""""01 DEC",1234.45,2344,11,1212.66
"01 NOV", 9898.33, 2343,12,1009.33""")
dateparse = lambda x: pd.datetime.strptime(x,"%d %b")
df = pd.read_csv(a,header=None, parse_dates=[0], date_parser=dateparse)
print df
You can append custom year to x before converting it to datetime
.strptime(year + x,"%Y%d %b")
Output:
0 1 2 3 4
0 1900-12-01 1234.45 2344 11 1212.66
1 1900-11-01 9898.33 2343 12 1009.33

Thank you both for your input. From your answers I modified the csv file to remove the quotes around the date entry, then things worked fine! I am puzzled because I have used the read_csv method before on similar data that looked like this:
"12/31/2016","The UPS Store","THE UPS STORE 031","10.74","debit","Business Services","Interest Checking","",""
"12/31/2016","Hospice of The East Bay","HOSPICE OF THE EAST","14.00","debit","Clara","Interest Checking","",""
and had no problems – in fact I didn't need to parse the data at all and the reader was able to correctly identify the date. Huh! I guess the real issue was that the date was stored in an unconventional format. In any case, I have the answer and thank you both for your answers.

Related

OCTAVE data import from PCE-VDL data logger device and conversion of decimal coma to decimal point

I have a measurement device PCE-VDL, which gives me measurements in following CSV format below, which I need to import to OCTAVE for further investigation.
Especially I need to import last 3 columns with xyz acceleration data.
The file is in CSV format with delimiter of semicolon ";".
I have tried:
A_1 = importdata ("file.csv", ";", 3);
but have recieved
error: missing_idx(10): out of bound 9
The CSV file looks like this:
#PCE-VDL X - TableView series
#2020.16.11
#Date;Time;Duration [s];t [°C];RH [%];p [mbar];aX [g];aY [g];aZ [g];
2020.28.10;16:16:32:0000;00:000;;;;0,0195;-0,0547;1,0039;
2020.28.10;16:16:32:0052;00:005;;;;0,0898;-0,0273;0,8789;
2020.28.10;16:16:32:0104;00:010;;;;0,0977;-0,0313;0,9336;
2020.28.10;16:16:32:0157;00:015;;;;0,1016;-0,0273;0,9297;
The numbers in last 3 columns have also decimal coma and not decimal point. So there probably should be done also some conversion.
Thank you very much for any help.
Regards
EDIT: 18.11.2020
Thanks for help. I have tried now following:
A_1_str = fileread ("file.csv");
A_1_str_m = strrep (A_1_str, ".", "-");
A_1_str_m = strrep (A_1_str_m, ",", ".");
save "A_1_str_m.csv" A_1_str_m;
A_1 = importdata ("A_1_str_m.csv", ";", 8);
and still receive error: file_content(140): out of bound 139
There is probably some problem with time format in first columns, which I do not want to read. I just need last three columns.
After my conversion, the file looks like this:
# Created by Octave 5.1.0, Wed Nov 18 21:40:52 2020 CET <zdenek#ASUS-F5V>
# name: A_1_str_m
# type: sq_string
# elements: 1
# length: 7849
#PCE-VDL X - TableView series
#2020-16-11
#Date;Time;Duration [s];t [°C];RH [%];p [mbar];aX [g];aY [g];aZ [g];
2020-28-10;16:16:32:0000;00:000;;;;0.0195;-0.0547;1.0039;
2020-28-10;16:16:32:0052;00:005;;;;0.0898;-0.0273;0.8789;
2020-28-10;16:16:32:0104;00:010;;;;0.0977;-0.0313;0.9336;
Thanks for support!
You can first read the data with fileread, which stores the data as a string. Then you can manipulate the string like this:
new_string = strrep(string, ",", ".");
strrep replaces all occurrences of a pattern within a string. Afterwards you save this data as a separate file or you overwrite the existing file with the manipulated data. When this is done you proceed as you have tried before.
EDIT: 19.11.2020
To avoid the additional heading lines in the new file, you can save it like this:
fid = fopen("A_1_str_m.csv", "w");
fputs(fid, A_1_str_m);
fclose(fid);
fputs will just write the string to the file.
The you can read the new file with dlmread.
A1_buf = dlmread("A_1_str_m.csv", ";");
A1_buf = real(A1); # get the real value of the complex number
A1_buf(1:3, :) = []; # remove the headlines
A1 = A1_buf(:, end-3:end-1); # get only the the 3 columns you're looking for
This will give you the three columns your looking for. But the date and time data will be ignored.
EDIT 20.11.2020
Replaced abs with real, so the sign of the value will be kept.
Use csv2cell from the io package.

Print statement with .format results in error

I have following print statement:-
print("{0:0.2f}% in traing set".format((len(X_train)/(df.index))*100))
where
X_train = is 70% of sample data(training data)
and
df.index is RangeIndex(start=0, stop=768, step=1)
when I run print statement I get error as follow
non-empty format string passed to object.__format__
Answer of the print statement should be
69.92 in training set
30.08 in test set
I am not able to correct this behavior.
Any help will be appreciated.
Bharat.
I get this error if the object that's supposed to be formatted is a numpy array. RangeIndex() must be producing something similar.
In [82]: "{0:0.2f}% in traing set".format((20/(np.arange(3)))*100)
...
TypeError: non-empty format string passed to object.__format__
In [83]:
Same error if I just give it a list: "{0:0.2f}% in traing set".format([1,2,3])
Formatting works fine if the argument is just a number:
In [83]: "{0:0.2f}% in traing set".format((20/3)*100)
Out[83]: '666.67% in traing set'
The only format specifier that works with an object like a list or array is the plain str/repr,
In [102]: print("{} in traing set".format((len(X_train)/(np.arange(1,3)))*100))
[ 5000. 2500.] in traing set
===============
It's hard to read code in a comment; better to add it as an edit to the original question (clearly marked as such). Here's my guess as to what the formatting is:
value1 = len(X_train)
value2 = len(X_test)
value3 = len(df.index)
value4 = (value1/value3)*100
value5 = (value2/value3)*100
print("{0:0.2f}% in training set".format(value4))
print("{0:0.2f}% in test set".format(value5))
Yes, it is a good idea to separate the calculation of the values from the print formatting.

Slow regexprep with a very long string

I have simulation data in an ascii file with a lot of data points. I'm trying to extract variable names and their values from it. The below is an example of what the file format looks like:
*ESA
*COM on Tue Sep 27 15:23:02 2016
*COM C:\Users\vi813c\Documents\My Matlab\
*COM The pathname to the ESB file was: C:\Users\vi813c\Documents\My Matlab
Case013
*RTITLE
Run Date/Time = 20-SEP-2016 13:29:00
MSC.EASY5 time-history plot with 20001 data points
*EOD
*FLOAT
TIME FDLB(1) FSLB(1) FVLB(1) MXLB(1) \
MYLB(1) MZLB(1) FDLB(2) FSLB(2) FVLB(2) \
MXLB(2) MYLB(2) MZLB(2) FDLB(3) FSLB(3) \
FVLB(3) MXLB(3) MYLB(3) MZLB(3)
0 884.439 -0 53645.8 -972.132
-311780 207.866 5403.68 1981.49 327781
258746 -1.74898E+006 84631.4 5384.25 -1308.47
326538 -97028.6 -1.74013E+006 -61858.1
0.002 882.616 0.008033 53661.1 -972.4
-311702 207.779 5400.42 1982.11 327784
258726 -1.74906E+006 84628.3 5381.01 -1308.44
326541 -97040.1 -1.74021E+006 -61858.8
0.004 876.819 0.031336 53705.6 -973.183
-311683 207.661 5391.19 1983.9 327795
258693 -1.74935E+006 84624 5371.85 -1309.63
326552 -97040.6 -1.74051E+006 -61858.8
0.006 869.491 0.061631 53763.3 -974.213
-311806 207.618 5377.45 1986.76 327813
258659 -1.74995E+006 84621.7 5358.2 -1312.04
326569 -97040.3 -1.7411E+006 -61861
0.008 861.718 0.095625 53828.1 -975.379
-312039 207.648 5360.82 1990.12 327834
A summary of data format characteristics is as follows:
Everything above "*FLOAT" is a header and I need to get rid of it
Stuff between "*FLOAT" and the first numeric value are the variable names
The variable names and the values are delimited by space(s) and '\'
The data are "lumped". Each lump has values for the variables at a given simulation time step. In the example above, there are 19 variables so that there are 19 numeric values in each lump
There can be multiple data sets; each preceded with "*FLOAT" and a variable name section
The following is how I am currently handling this data:
fileread the file --> one big string of characters
regexprep {'\s+,'\','\n'} with ',' --> comma delimited for strsplit
strfind "*FLOAT"
strsplit by ',' --> now becomes a cell
find the first numeric value by isnan(str2double(parse))
Then between the index from 2. and the index from 4 are the variable names and between the index from 4 and the next "*FLOAT" are the numeric data
This scheme is sort of working, but I can't stop thinking that there's gotta be a better way to do this. For one, the step 1. is extremely slow. I guess it's one big string for regexprep to work on with multiple things to replace.
How can I improve my script?
I gave this a shot with the string class which is new in 16b.
str = string(fileread('file.txt'));
fileNewline = [13 newline]; % This data has carriage returns
str = extractAfter(str, ['*FLOAT' fileNewline]);
str = erase(str, ['\' fileNewline]);
str = splitlines(str);
% Get the variable names
varNames = split(str(1))';
% Get the data
data = reshape(str(2:end), 4, [])';
data = strip(data);
data = join(data);
data = split(data);
data = double(data);
I'm not sure about how to load the file faster.
As mentioned in another comment, textscan could probably help. It might end up being the fastest solution. With the correct format specified and using the 'HeaderLines' option, I think you can make it work.

MATLAB reading CSV file with timestamp and values

I have the following sample from a CSV file. Structure is:
Date ,Time(Hr:Min:S:mS), Value
2015:08:20,08:20:19:123 , 0.05234
2015:08:20,08:20:19:456 , 0.06234
I then would like to read this into a matrix in MATLAB.
Attempt :
Matrix = csvread('file_name.csv');
Also tried an attempt formatting the string.
fmt = %u:%u:%u %u:%u:%u:%u %f
Matrix = csvread('file_name.csv',fmt);
The problem is when the file is read the format is wrong and displays it differently.
Any help or advice given would be greatly appreciated!
EDIT
When using #Adriaan answer the result is
2015 -11 -9
8 -17 -1
So it seems that MATLAB thinks the '-' is the delimiter(separator)
Matrix = csvread('file_name.csv',1,0);
csread does not support a format specifier. Just enter the number of header rows (I took it to be one, as per example), and number of header columns, 0.
You file, however, contains non-numeric data. Thus import it with importdata:
data = importdata('file_name.csv')
This will get you a structure, data with two fields: data.data contains the numeric data, i.e. a vector containing your value. data.textdata is a cell containing the rest of the data, you need the first two column and extract the numerics from it, i.e.
for ii = 2:size(data.textdata,1)
tmp1 = data.textdata{ii,1};
Date(ii,1) = datenum(tmp1,'YYYY:MM:DD');
tmp2 = data.textdata{ii,2};
Date(ii,2) = datenum(tmp2,'HH:MM:SS:FFF');
end
Thanks to #Excaza it turns out milliseconds are supported.

Why is datestr('19-01-2004') = 26-Jun-0024 in MATLAB R2011a?

I also tried the following:
datestr('19-01-2004','dd-mm-yyyy')
ans =
26-06-0024
I am new to MATLAB, so I am not sure what else to check.
In the function datestr(), the 2nd parameter denotes how the output should look like. It doesn't say anything about the input.
Essentially, you try to perform 2 steps: parse a string and then format the parsed date again.
So you can do
n = datenum('19-01-2004','dd-mm-yyyy')
datestr(n, 'yyyy-mm-dd')
and you'll get an n of 731965 and a final output of 2004-01-19.
You can as well do
v = datevec('19-01-2004','dd-mm-yyyy')
datestr(v, 'yyyy-mm-dd')
and your v becomes [2004 1 19 0 0 0].
So remember: step 1 - parsing of input with the appropriate format string, step 2 - formatting of output with the wanted format string.
If you want to give the date in a "clean" and readable format, you could just do
v = [2004 1 19 0 0 0]
datestr(v, 'yyyy-mm-dd')
datestr(v, 'dd.mm.yyyy')
datestr(v, 'mm/dd/yyyy')
When using datestr to convert a date string from one form to another, the format of the input date string is limit to those listed here. The format of your input '19-01-2004' is 'dd-mm-yyyy' and is not one of the supported formats.
If we change the input string to '01/19/2004', which is the supported format 'mm/dd/yyyy', we get the correct output:
>> datestr('01/19/2004','dd-mm-yyyy')
ans =
19-01-2004
To circumvent the limited number of supported input formats, the documentation recommends using datenum first. So you can map your original input onto itself like:
>> datestr(datenum('19-01-2004','dd-mm-yyyy'),'dd-mm-yyyy')
ans =
19-01-2004
As for why MATLAB returns the date it does has to do with how it handles the unknown format.
I suspect whatever method they use to finally decide upon a format results in a really small date number, hence the year 24 output.