Matlab: Join datasets by not exact but similar values

Matlab: Join datasets by not exact but similar values - matlab

I have two example datasets, A and B below, that I want to join in Matlab to create C. The keys will be 'product' and 'year', but the problem is that the product number in dataset B only matches the one in A by the first 4 digits. Is there a way to join 'almost' matching numbers in this way?
A
product tariff year
202341 2 1999
202341 4 2000
202341 20 2008
202355 9 1999
202355 16 2000
438811 0 1999
438891 8 1999
438891 3 2001
671212 15 2005
671260 10 2005
and
B
product avg_tariff year
2023 5,5 1999
2023 10 2000
2023 20 2008
4388 4 1999
4388 3 2001
6712 12,5 2005
are joined to produce matrix C
C
product tariff year avg_tariff
202341 2 1999 5,5
202341 4 2000 10
202341 20 2008 20
202355 9 1999 5,5
202355 16 2000 10
438811 0 1999 4
438891 8 1999 4
438891 3 2001 3
671212 15 2005 12,5
671260 10 2005 12,5
Thanks in advance
Oscar

Since this question is related to a previous one of yours I answered, I will reuse the code and update it to the new data:
a.csv
product tariff year
202341 2 1999
202341 4 2000
202341 20 2008
202355 9 1999
202355 16 2000
438811 0 1999
438891 8 1999
438891 3 2001
671212 15 2005
671260 10 2005
b.csv
product avg_tariff year
2023 5.5 1999
2023 10 2000
2023 20 2008
4388 4 1999
4388 3 2001
6712 12.5 2005
MATLAB code
(using the Dataset class from the Statistics Toolbox):
%# read A, and build dataset
fid = fopen('a.csv','rt');
C = textscan(fid, '%s%f%f', 'Delimiter',' ', 'MultipleDelimsAsOne',true, 'HeaderLines',1);
fclose(fid);
dA = dataset({C{1} 'product'}, {C{2} 'tariff'}, {C{3} 'year'});
%# read B, and build dataset
fid = fopen('b.csv','rt');
C = textscan(fid, '%s%f%f', 'Delimiter',' ', 'MultipleDelimsAsOne',true, 'HeaderLines',1);
fclose(fid);
dB = dataset({C{1} 'product'}, {C{2} 'avg_tariff'}, {C{3} 'year'});
%# truncate productA
dA.productLong = dA.product;
dA.product = cellfun(#(s)s(:,1:end-2), cellstr(dA.product), 'UniformOutput',false);
%# inner join (keep only rows that exist in both datasets)
ds = join(dA, dB, 'keys',{'product' 'year'}, 'type','inner', 'MergeKeys',true);
%# restore the long product number as first column, and sort by it
ds.product = ds.productLong;
ds.productLong = [];
ds = sortrows(ds, 'product')
The result as expected:
ds =
product tariff year avg_tariff
'202341' 2 1999 5.5
'202341' 4 2000 10
'202341' 20 2008 20
'202355' 9 1999 5.5
'202355' 16 2000 10
'438811' 0 1999 4
'438891' 8 1999 4
'438891' 3 2001 3
'671212' 15 2005 12.5
'671260' 10 2005 12.5

load the product array and treat it as strings using textscan:
fidA = fopen('A.txt');
fidB = fopen('B.txt');
A = textscan(fidA,'%s%s%s','delimiter',' ');
B = textscan(fidB,'%s%s%s','delimiter',' ');
fclose(fidA);
fclose(fidB);
keep only the first 4 chars of product in A
for i = 1:length(A{1})
rowKeyA{i} = [A{1}{i}(1:4),A{3}{i}]; %product(1:4),year
end
for i = 1:length(B{1})
rowKeyB{i} = [B{1}{i},B{3}{i}]; %product,year
end
now just find matches between rowKeyA and rowKeyB
for i = 1:length(rowKeyA)
j = find(strcmp(rowKeyB,rowKeyA{i}),1);
if(j)
fprintf('%s %s %s\n',rowKeyA{i},A{2},B{2});
end
end

Related

Problems with fprintf format (Matlab)

I want to correct variables' format in a txt file (show at the end, replace spaces for tab spaces), using the next Matlab code (previous import):
id = fopen('datoscorfecha.txt', 'w');
fprintf(id, '%5s %3s %3s %3s %4s %3s %6s\n',...
'fecha', 'dia','mes', 'ano', 'hora', 'min', 'abs370');
datos = cat(2,dia, mes, ano, hora, min1, abs370);
datos = datos';
fecha = Fecha'; % Imported as a string
fprintf(id, '%16s %2i %2i %4i %2i %2i %8.4f\n',...
fecha, datos);
fclose(id);
type datoscorfecha.txt
But I get this error:
Error using fprintf
Unable to convert 'string' value to
'int64'.
Fecha dia mes ano hora min abs370
03/06/2016 00:00 3 6 2016 0 0 29.356218
03/06/2016 00:05 3 6 2016 0 5 30.45703
03/06/2016 00:10 3 6 2016 0 10 27.53877
03/06/2016 00:15 3 6 2016 0 15 23.19832
03/06/2016 00:20 3 6 2016 0 20 22.333924
03/06/2016 00:25 3 6 2016 0 25 22.086426
03/06/2016 00:30 3 6 2016 0 30 20.933898

Maybe something like this can allow you to replace the spaces with tabs. Here I read the text file using the textscan() function and separate the columns. I also parse each value/term as a string. By using the writematrix() function I can write the data to a new text file the but with the Delimeter set to tab.
Text.txt (Input)
Fecha dia mes ano hora min abs370
03/06/2016 00:00 3 6 2016 0 0 29.356218
03/06/2016 00:05 3 6 2016 0 5 30.45703
03/06/2016 00:10 3 6 2016 0 10 27.53877
03/06/2016 00:15 3 6 2016 0 15 23.19832
03/06/2016 00:20 3 6 2016 0 20 22.333924
03/06/2016 00:25 3 6 2016 0 25 22.086426
03/06/2016 00:30 3 6 2016 0 30 20.933898
datoscorfecha.txt (Output)
Fecha dia mes ano hora min abs370
03/06/2016 00:00 3 6 2016 0 0 29.3562
03/06/2016 00:05 3 6 2016 0 5 30.4570
03/06/2016 00:10 3 6 2016 0 10 27.5388
03/06/2016 00:15 3 6 2016 0 15 23.1983
03/06/2016 00:20 3 6 2016 0 20 22.3339
03/06/2016 00:25 3 6 2016 0 25 22.0864
03/06/2016 00:30 3 6 2016 0 30 20.9339
Full Script:
File_ID = fopen("Text.txt");
Data = textscan(File_ID, '%s %s %s %s %s %s %s %s', 'Delimiter',' ');
fclose(File_ID);
% Data = readtable("Text.txt");
Column_1 = string(Data{:,1});
Column_2 = string(Data{:,2});
Column_3 = string(Data{:,3});
Column_4 = string(Data{:,4});
Column_5 = string(Data{:,5});
Column_6 = string(Data{:,6});
Column_7 = string(Data{:,7});
Column_8 = string(Data{:,8});
for Index = 2: length(Column_8)
Number = str2double(char(Column_8(Index,1)));
Number = num2str(Number);
Decimal_String = split(Number,".");
Decimal_String = Decimal_String{2};
if length(Decimal_String) ~= 4
Number = string(Number) + "0";
end
Column_8(Index,1) = Number;
end
Table = [Column_1 Column_2 Column_3 Column_4 Column_5 Column_6 Column_7 Column_8];
writematrix(Table,"datoscorfecha.txt",'Delimiter','tab');
type datoscorfecha.txt
Ran using MATLAB R2019b

Pivot table with multiple keyed columns

I have the following table:
t:(([]y:2001 2002) cross ([]m:5 6 7) cross ([]sector:`running`hiking`swimming`cycling)),'([]sales: 14 12 5 9 4 894 1 4 87 12 24 6 4 8 64 354 3 4 86 43 1053 2 43 4);
y m sector sales
------------------------
2001 5 running 14
2001 5 hiking 12
2001 5 swimming 5
2001 5 cycling 9
2001 6 running 4
2001 6 hiking 894
2001 6 swimming 1
2001 6 cycling 4
...
2002 5 running 4
2002 5 hiking 8
2002 5 swimming 64
2002 5 cycling 354
2002 6 running 3
...
I want to pivot the sales values by sector, while keeping the first two y and m columns, such that the resulting table would look like this:
y m cycling hiking running swimming
--------------------------------------
2001 5 9 12 14 5
2001 6 4 894 4 1
2001 7 6 12 87 24
2002 5 354 8 4 64
2002 6 43 4 3 86
2002 7 4 2 1053 43

As per
https://code.kx.com/v2/kb/pivoting-tables/
q) P:asc exec distinct sector from t;
q) exec P#(sector!sales) by y:y,m:m from t
You can unkey the result by () xkey if you need a normal table.

Perl function localtime giving incorrect values for years between 1964 and 1967

I was getting some whacky values from localtime function in Perl. The following is some code for which I get incorrect values.
In particular, this code is meant to determine the weekday for the first of each year.
#!/usr/bin/perl
use strict 'vars';
use Time::Local;
use POSIX qw(strftime);
mytable();
sub mytable {
print "Year" . " "x4 . "Jan 1st (localtime)" . " "x4 . "Jan 1st (Gauss)\n";
foreach my $year ( 1964 .. 2017 )
{
my $janlocaltime = evalweekday( 1,1,$year);
my $jangauss = gauss($year);
my $diff = $jangauss - $janlocaltime;
printf "%4s%10s%-12s ",$year,"",$janlocaltime;
printf "%12s",$jangauss;
printf " <----- ERROR: off by %2s", $diff if ( $diff != 0 );
print "\n";
}
}
sub evalweekday {
## Using "localtime"
my ($day,$month,$year) = #_;
my $epoch = timelocal(0,0,0, $day,$month-1,$year-1900);
my $weekday = ( localtime($epoch) ) [6];
return $weekday;
}
sub gauss {
## Alternative approach
my ($year) = #_;
my $weekday =
( 1 + 5 * ( ( $year - 1 ) % 4 )
+ 4 * ( ( $year - 1 ) % 100 )
+ 6 * ( ( $year - 1 ) % 400 )
) % 7;
return $weekday;
}
Here is the output which shows the years with incorrect values:
Year Jan 1st (localtime) Jan 1st (Gauss)
1964 2 3 <----- ERROR: off by 1
1965 4 5 <----- ERROR: off by 1
1966 5 6 <----- ERROR: off by 1
1967 6 0 <----- ERROR: off by -6
1968 1 1
1969 3 3
1970 4 4
1971 5 5
1972 6 6
1973 1 1
1974 2 2
1975 3 3
1976 4 4
1977 6 6
1978 0 0
1979 1 1
1980 2 2
1981 4 4
1982 5 5
1983 6 6
1984 0 0
1985 2 2
1986 3 3
1987 4 4
1988 5 5
1989 0 0
1990 1 1
1991 2 2
1992 3 3
1993 5 5
1994 6 6
1995 0 0
1996 1 1
1997 3 3
1998 4 4
1999 5 5
2000 6 6
2001 1 1
2002 2 2
2003 3 3
2004 4 4
2005 6 6
2006 0 0
2007 1 1
2008 2 2
2009 4 4
2010 5 5
2011 6 6
2012 0 0
2013 2 2
2014 3 3
2015 4 4
2016 5 5
2017 0 0
In fact, the errors seem to extend as far back as 1900, but I just haven't verified that they are in fact wrong prior to 1964.
perl --version returns the following:
This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)
Copyright 1987-2013, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.
I'm not sure whether it's relevant, but my operating system is macOS Sierra Version 10.12.3.
I've read through the documentation, but I don't see anything (or I'm being blind) regarding values returned prior to 1968. I've also tried to do a websearch but am not pulling up anything beyond the typical misunderstandings of array values and the numbering of months and days of the year.
Could someone help me out and explain what I'm getting wrong? Or, if this is an issue with my version of Perl, let me know what I can do to fix it.

This is likely to do with how negative epoch values are handled in Time::Local. Have a look at perldoc Time::Local #Negative-Epoch-Values
On my Linux box (perl 5.20), your code demonstrates the issue nicely. If you print out the epoch value received, you will see the issue, namely that the epoch returned by timelocal becomes huge instead of more negative:
Year Epoch Jan 1st (localtime) Jan 1st (Gauss)
1964 2966342400 2 3 <----- ERROR: off by 1
1965 2997964800 4 5 <----- ERROR: off by 1
1966 3029500800 5 6 <----- ERROR: off by 1
1967 3061036800 6 0 <----- ERROR: off by -6
1968 -63185400 1 1
1969 -31563000 3 3
1970 -27000 4 4
1971 31509000 5 5
1972 63045000 6 6
Why don't you try using DateTime library instead:
use DateTime;
my $dt = DateTime->new(
year => 1966, # Real Year
day => 1, # 1-31
month => 1, # 1-12
hour => 0, # 0-23
second => 0, # 0-59
);
print $dt->dow . "\n";
6
6 = Saturday which matches the Wikipedian view: Jan 1, 1966 (Saturday)

how to get the row in some column in one table

I have some problem with query in postgresql
I have 7 column in one table
Year Month Date Rain Tmax Tmin ID Stat Location
1996 1 1 3 25.4 20 98212 air
1996 1 2 1 25.4 19.6 96112 land
1996 1 3 -9999 24.6 19.2 97110 sea
1996 1 4 1 22 19 98212 air
1996 1 5 -9999 24.4 19 96112 land
1996 1 6 -9999 24.2 18.6 98212 air
1996 1 7 1 24.2 19.4 96112 land
1996 1 8 -9999 24.8 20 97110 sea
1996 1 9 -9999 25 19.6 97110 sea
I want to query the row in table and get output to the text file with name (ID-Stat Location)
the expected output :
98212-air.txt
Year Month Date Rain Tmax Tmin
1996 1 1 3 25.4 20
1996 1 4 1 22 19
1996 1 6 -9999 24.2 18.6
what should I do?
I'm using postgresql.
thank you..

This is the query to get the output like you told but writing in the text file you need to work.
SELECT year,month,date, rain, tmax,timin
FROM yourTable WHERE Location='air' and id_stat='98212';

Plot dates matlab

I've a matrix called datevector containing the year, month, day, hour, minutes, seconds of the timeseries that I would like to plot.
datevector = [...
2009 11 4 11 35 0
2009 11 4 11 36 0
2009 11 4 11 37 0
2009 11 4 11 38 0
2009 11 4 11 39 0
2009 11 4 11 40 0]
To plot my data with respect to this time series I create the array containing the time series
xdate = datenum(datevector);
and then I try to plot my data = [1 2 3 4 5 6]
figure
plot(xdate',data)
datetick('x','yyyy-mm-dd HH:MM:SS')
...well the figure I get is not the one expected...I would like to have a minute resolution as in datavector...can you help me?
Thanks!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Matlab: Join datasets by not exact but similar values - matlab

Related

Problems with fprintf format (Matlab)

Pivot table with multiple keyed columns

Perl function localtime giving incorrect values for years between 1964 and 1967

how to get the row in some column in one table

Plot dates matlab

Categories

Resources