Convert chararray to float - type-conversion

I am new to pig programming. I have one txt file and comma (,) as a delimiter. In amount columns i.e; amt_IN and amy_OUT are of type chararray with data $830.03 and $1392.54 respectively.
I need these two columns in INR. I tried this by removing the $ symbol from the string first and then tried to convert it to float.
Following is txt file - petrol.csv
Z7O 7C2,reliance,$830.03,$1392.54,1067,722,1625
T6Q 0L9,hindustan,$994.57,$11765.97,1039,805,1626
S1J 8B8,Bharat,$881.25,$10345.43,1066,657,1627
I used the following code to remove the $ symbol.
A = LOAD 'project/petrol.txt' USING PigStorage(',') as (distributer_id:chararray,distributer_name:chararray,amt_IN:chararray,amy_OUT:chararray,vol_IN:int,vol_OUT:int,year:int);
DUMP A;
(Z7O 7C2,reliance,$830.03,$1392.54,1067,722,1625)
(T6Q 0L9,hindustan,$994.57,$11765.97,1039,805,1626)
(S1J 8B8,Bharat,$881.25,$10345.43,1066,657,1627)
X = FOREACH A GENERATE distributer_id,distributer_name,SUBSTRING(amt_IN,1),SUBSTRING(amy_OUT,1),vol_IN,vol_OUT,year;
DUMP X;
(Z7O 7C2,reliance,830.03,1392.54,1067,722,1625)
(T6Q 0L9,hindustan,994.57,11765.97,1039,805,1626)
(S1J 8B8,Bharat,881.25,10345.43,1066,657,1627)
I need amt_IN and amy_OUT to be converted to float so that I can convert the amount in USD to INR.
Thank you in advance for the help.

In Pig, you can cast data types by using (type) in front of the variable. If using Pig 0.17, you can also use AS var_name:type.
As you might not be using Pig 0.17, I expect the first approach is most likely to work.
X = FOREACH A GENERATE
distributer_id,
distributer_name,
(float) SUBSTRING(amt_IN,1),
(float) SUBSTRING(amy_OUT,1),
vol_IN,
vol_OUT,
year;

Related

OCTAVE data import from PCE-VDL data logger device and conversion of decimal coma to decimal point

I have a measurement device PCE-VDL, which gives me measurements in following CSV format below, which I need to import to OCTAVE for further investigation.
Especially I need to import last 3 columns with xyz acceleration data.
The file is in CSV format with delimiter of semicolon ";".
I have tried:
A_1 = importdata ("file.csv", ";", 3);
but have recieved
error: missing_idx(10): out of bound 9
The CSV file looks like this:
#PCE-VDL X - TableView series
#2020.16.11
#Date;Time;Duration [s];t [°C];RH [%];p [mbar];aX [g];aY [g];aZ [g];
2020.28.10;16:16:32:0000;00:000;;;;0,0195;-0,0547;1,0039;
2020.28.10;16:16:32:0052;00:005;;;;0,0898;-0,0273;0,8789;
2020.28.10;16:16:32:0104;00:010;;;;0,0977;-0,0313;0,9336;
2020.28.10;16:16:32:0157;00:015;;;;0,1016;-0,0273;0,9297;
The numbers in last 3 columns have also decimal coma and not decimal point. So there probably should be done also some conversion.
Thank you very much for any help.
Regards
EDIT: 18.11.2020
Thanks for help. I have tried now following:
A_1_str = fileread ("file.csv");
A_1_str_m = strrep (A_1_str, ".", "-");
A_1_str_m = strrep (A_1_str_m, ",", ".");
save "A_1_str_m.csv" A_1_str_m;
A_1 = importdata ("A_1_str_m.csv", ";", 8);
and still receive error: file_content(140): out of bound 139
There is probably some problem with time format in first columns, which I do not want to read. I just need last three columns.
After my conversion, the file looks like this:
# Created by Octave 5.1.0, Wed Nov 18 21:40:52 2020 CET <zdenek#ASUS-F5V>
# name: A_1_str_m
# type: sq_string
# elements: 1
# length: 7849
#PCE-VDL X - TableView series
#2020-16-11
#Date;Time;Duration [s];t [°C];RH [%];p [mbar];aX [g];aY [g];aZ [g];
2020-28-10;16:16:32:0000;00:000;;;;0.0195;-0.0547;1.0039;
2020-28-10;16:16:32:0052;00:005;;;;0.0898;-0.0273;0.8789;
2020-28-10;16:16:32:0104;00:010;;;;0.0977;-0.0313;0.9336;
Thanks for support!
You can first read the data with fileread, which stores the data as a string. Then you can manipulate the string like this:
new_string = strrep(string, ",", ".");
strrep replaces all occurrences of a pattern within a string. Afterwards you save this data as a separate file or you overwrite the existing file with the manipulated data. When this is done you proceed as you have tried before.
EDIT: 19.11.2020
To avoid the additional heading lines in the new file, you can save it like this:
fid = fopen("A_1_str_m.csv", "w");
fputs(fid, A_1_str_m);
fclose(fid);
fputs will just write the string to the file.
The you can read the new file with dlmread.
A1_buf = dlmread("A_1_str_m.csv", ";");
A1_buf = real(A1); # get the real value of the complex number
A1_buf(1:3, :) = []; # remove the headlines
A1 = A1_buf(:, end-3:end-1); # get only the the 3 columns you're looking for
This will give you the three columns your looking for. But the date and time data will be ignored.
EDIT 20.11.2020
Replaced abs with real, so the sign of the value will be kept.
Use csv2cell from the io package.

Trouble importing csv-data within MATLAB

I am trying to read in a csv-file that contains daily data on EUR/USD exchange rates including the dates specifying year, month and day. The problem is that using readtable(filename) puts single quotes around all table-entries and therefore hinders me using the data at all.
Detect import options:
opts = detectImportOptions('EUR_USD Historische Data.csv');
Read in the data:
EUR_USD = readtable('EUR_USD Historische Data.csv');
Substract dates and transform to datetime variable:
dt = EUR_USD(:,1);
dates = datetime(dt,'InputFormat','yyyyMMdd');
% Does not work because of single quotes
I was able to subtract closing prices and make them workable, but I am not sure if this is an elegant way of doing so:
closing_prices = str2double(table2array(EUR_USD(:,5)));
Ultimately the goal is to make the data workable. I need to compare two columns with datetime-variables and if dates do not match between the two columns I need to remove that entry such that in the end both columns match.
This is the vector with dates:
Dates vector wrong
I need it to look like this:
Dates vector correct
I think all you need to do is remove the ' character in order to read the data into datetime correctly. Look at the following example:
%stringz is the same as dt here: just the string data
T = table;
T.stringz = string(['''string1'''; '''string2'''; '''string3''']);
stringz = T.stringz;
%Run the for loop to remove the ' chars
for i = 1:length(stringz)
strval = char(stringz(i,1));
strval = strval(2:end-1);
strmat(i,1) = string(strval);
end
%Then load data into datetime after this for loop
dates = datetime(strmat,'InputFormat','yyyyMMdd');
strmat return a 3x1 string array with no ' characters on the outside of the string.

MATLAB: extract numerical data from alphanumerical table and save as double

I created a list of names of data files, e.g. abc123.xml, abc456.xml, via
list = dir('folder/*.xml').
Matlab starts this out as a 10x1 struct with 5 fields, where the first one is the name. I extracted the needed data with struct2table, so I now got a 10x1 table. I only need the numerical value as 10x1 double. How can I get rid of the alphanumerical stuff and change the data type?
I tried regexp (Undefined function 'regexp' for input arguments of type 'table') and strfind (Conversion to double from table is not possible). Couldn't come up with anything else, as I'm very new to Matlab.
You can extract the name fields and place them in a cell array, use regexp to capture the first string of digits it finds in each name, then use str2double to convert those to numeric values:
strs = regexp({list.name}, '(\d+)', 'once', 'tokens');
nums = str2double([strs{:}]);

MATLAB reading CSV file with timestamp and values

I have the following sample from a CSV file. Structure is:
Date ,Time(Hr:Min:S:mS), Value
2015:08:20,08:20:19:123 , 0.05234
2015:08:20,08:20:19:456 , 0.06234
I then would like to read this into a matrix in MATLAB.
Attempt :
Matrix = csvread('file_name.csv');
Also tried an attempt formatting the string.
fmt = %u:%u:%u %u:%u:%u:%u %f
Matrix = csvread('file_name.csv',fmt);
The problem is when the file is read the format is wrong and displays it differently.
Any help or advice given would be greatly appreciated!
EDIT
When using #Adriaan answer the result is
2015 -11 -9
8 -17 -1
So it seems that MATLAB thinks the '-' is the delimiter(separator)
Matrix = csvread('file_name.csv',1,0);
csread does not support a format specifier. Just enter the number of header rows (I took it to be one, as per example), and number of header columns, 0.
You file, however, contains non-numeric data. Thus import it with importdata:
data = importdata('file_name.csv')
This will get you a structure, data with two fields: data.data contains the numeric data, i.e. a vector containing your value. data.textdata is a cell containing the rest of the data, you need the first two column and extract the numerics from it, i.e.
for ii = 2:size(data.textdata,1)
tmp1 = data.textdata{ii,1};
Date(ii,1) = datenum(tmp1,'YYYY:MM:DD');
tmp2 = data.textdata{ii,2};
Date(ii,2) = datenum(tmp2,'HH:MM:SS:FFF');
end
Thanks to #Excaza it turns out milliseconds are supported.

Reading Columns

I am working on a C code to read in three columns of numbers from an input file and then do basic math with the numbers obtained. My input file looks like:
155.4996 38.0078 7.65
93.9968 44.9926 7.68
I am currently trying to separate columns using sscanf. To get this started I am trying to read in the columns and print just the third column to an output file. Below is what I have right now:
FILE * fp;
FILE * fp2;
char *string;
char out[2000];
char read[1000];
int column1, column2, column3;
strcpy(read, "casecent");
strcpy(out, "Diff");
fp = fopen(read, "r");
fp2 = fopen(out, "w+");
while (!feof(fp))
{
fgets(string, 1000, fp);
sscanf(string, "%d %d %d", &column1, &column2, &column3);
fprintf(fp2,"%d\n", column3);
}
I am currently getting zeros in the output file instead of numbers. I'm sure I'm just missing something small and dumb but if you could help me out it would be much appreciated.
Use float or double for the column variables' data types. Then use %f or %lf respectively in the format string for sscanf, depending on which data type you chose.
If you want to store or print the values as integers, you'll still have to read them as floats or doubles first, then convert.