Appending datasets by matched variables - append

I have to append three datasets named A, B and C that contain data for various years (for example, 1990, 1991...2014).
The problem is that not all datasets contain all the survey years and therefore the unmatched years need to be dropped manually before appending.
I would like to know if there is any way to append three (or more) datasets that will keep only the matched variables across the datasets (years in this case).

Consider the following toy example:
clear
input year var
1995 0
1996 1
1997 2
1998 3
1999 4
2000 5
end
save data1, replace
clear
input year var
1995 6
1996 9
1998 7
1999 8
2000 9
end
save data2, replace
clear
input year var
1995 10
1996 11
1997 12
2000 13
end
save data3, replace
There is no option that will force append to do what you want, but you can do the following:
use data1, clear
append using data2 data3
duplicates tag year, generate(tag)
sort year
list
+------------------+
| year var tag |
|------------------|
1. | 1995 0 2 |
2. | 1995 6 2 |
3. | 1995 10 2 |
4. | 1996 9 2 |
5. | 1996 1 2 |
|------------------|
6. | 1996 11 2 |
7. | 1997 2 1 |
8. | 1997 12 1 |
9. | 1998 7 1 |
10. | 1998 3 1 |
|------------------|
11. | 1999 8 1 |
12. | 1999 4 1 |
13. | 2000 13 2 |
14. | 2000 5 2 |
15. | 2000 9 2 |
+------------------+
drop if tag == 1
list
+------------------+
| year var tag |
|------------------|
1. | 1995 0 2 |
2. | 1995 6 2 |
3. | 1995 10 2 |
4. | 1996 9 2 |
5. | 1996 1 2 |
|------------------|
6. | 1996 11 2 |
7. | 2000 13 2 |
8. | 2000 5 2 |
9. | 2000 9 2 |
+------------------+
You can also further generalize this approach by finding the maximum value of the variable tag and keeping all observations with that value:
summarize tag
keep if tag == `r(max)'

Related

Preserving blank columns & adding delimiters when reading fixed width data

I am parsing through a file.
The file format is like this:
Column1 Column2 Column3 Column4 Column5
1 2 3 4 5
6 7 8 9
10 11 12 14
15 16 17 18
Some of the Column's are empty. So I am reading two files having same format as above and merging both files and adding the "|" between each column so it should look like this:
Column1 | Column2 | Column3 | Column4 | Column5
1 | 2 | 3 | 4 | 5
6 | 7 | | 8 | 9
10 | 11 | 12 | | 14
| 15 | 16 | 17 | 18
But I'm getting like this. The spaces in columns are removed.
Column1 | Column2 | Column3 | Column4 | Column5
1 | 2 | 3 | 4 | 5
6 | 7 | 8 | 9
10 | 11 | 12 | 14
15 | 16 | 17 | 18
Code part:
while(<FH>){
my #lines =split ' ',$_;
say (join '|',#lines);
}
I know this is happening because I am splitting with space delimiter. Can anyone tell me how to get the desired output?
You can use unpack to parse fixed-width data. The A9 in the template assumes your columns are 9 characters wide. You can use sprintf to space the data out again into columns of the original width.
use warnings;
use strict;
while (<DATA>) {
chomp;
printf "%s\n", join '| ', map { sprintf '%-8s', $_ } unpack 'A9' x 5, $_;
}
__DATA__
Column1 Column2 Column3 Column4 Column5
1 2 3 4 5
6 7 8 9
10 11 12 14
15 16 17 18
This prints:
Column1 | Column2 | Column3 | Column4 | Column5
1 | 2 | 3 | 4 | 5
6 | 7 | | 8 | 9
10 | 11 | 12 | | 14
| 15 | 16 | 17 | 18
If you don't need to parse the data to do anything with, just reformat it, you can use a regex substitution to add in the vertical bar characters.
This code will add | after every 9 characters. This assumes that your data is fixed width columns. The \K assertion means to keep (get it?) all of the leftward matched text and not replace it with the substitution text. So in effect it allows you to set the point where text from the right side of the s/// will be placed. The /m option tells Perl that this is a multi-line string. The (?!$) assertion means "not at the end of the line" so that we don't insert anything after the final column.
I did it with all of the text in a single variable but you could do it line by line.
If the columns are variable width you can still do it with a regex but it gets more complicated. unpack/sprintf may well be simpler in that case.
$s = '
Column1 Column2 Column3 Column4 Column5
1 2 3 4 5
6 7 8 9
10 11 12 14
15 16 17 18
';
$s =~ s/.{9}(?!$)\K/| /gm;
print $s;
Column1 | Column2 | Column3 | Column4 | Column5
1 | 2 | 3 | 4 | 5
6 | 7 | | 8 | 9
10 | 11 | 12 | | 14
| 15 | 16 | 17 | 18
More info perlre.
Thanks.

I need to pivot the table and get the count for every year present for the respected ids

I need to retrieve the data in year wise that is count(values) for year and it should be in format like year and count(values)
select *
from crosstab(select id, year, count(values) from table)
as res(id int, year1 int, year2 int, year3 int, year4 int)
Data in Table
============================
Id | Year | values
============================
1 | 2015-01-10 | 2
1 | 2015-02-11 | 3
1 | 2016-03-11 | 5
1 | 2017-05-07 | 3
1 | 2014-01-01 | 1
2 | 2014-01-10 | 7
2 | 2015-03-03 | 9
2 | 2016-08-08 | 8
2 | 2017-09-09 | 5
Actual Result
===============================
id | 2014 | 2015 | 2016 | 2017
===============================
1 | 1 | 5 | 5 |3
2 | 7 | 9 | 8 |5

Architecture Design for Bus Routing with Time

This is to confirm if my design is good enough or get the better ideas to solve the bus routing problem with time. Here is my solution with the primary steps given below:
Have one edges table which represents all the edges (the source and target represent vertices (bus stops):
postgres=# select id, source, target, cost from busedges;
id | source | target | cost
----+--------+--------+------
1 | 1 | 2 | 1
2 | 2 | 3 | 1
3 | 3 | 4 | 1
4 | 4 | 5 | 1
5 | 1 | 7 | 1
6 | 7 | 8 | 1
7 | 1 | 6 | 1
8 | 6 | 8 | 1
9 | 9 | 10 | 1
10 | 10 | 11 | 1
11 | 11 | 12 | 1
12 | 12 | 13 | 1
13 | 9 | 15 | 1
14 | 15 | 16 | 1
15 | 9 | 14 | 1
16 | 14 | 16 | 1
Have a table which represents bus details like from time, to time, edge etc.
NOTE: I have used integer format for "from" and "to" column for faster results as I can do an integer query, but I can replace it with any better format if available.
postgres=# select id, "busedgeId", "busId", "from", "to" from busedgetimes;
id | busedgeId | busId | from | to
----+-----------+-------+-------+-------
18 | 1 | 1 | 33000 | 33300
19 | 2 | 1 | 33300 | 33600
20 | 3 | 2 | 33900 | 34200
21 | 4 | 2 | 34200 | 34800
22 | 1 | 3 | 36000 | 36300
23 | 2 | 3 | 36600 | 37200
24 | 3 | 4 | 38400 | 38700
25 | 4 | 4 | 38700 | 39540
Use dijkstra algorithm to find the nearest path.
Get the upcoming buses from the busedgetimes table in the earliest first order for the nearest path detected by dijkstra algorithm. => This leads to a bit complex query though.
Can I do any kind of improvements to this, or are there any better designs?
Links to docs, articles related to this would be really helpful.
This is totally normal and the regular way to do it. See also,
PgRouting Example

Tableau month-based bar chart for data with date range

I have data similar to the below:
id | start | end | name
1 | 2017-01-15 | 2017-03-30 | Item 1
2 | 2017-02-01 | 2017-05-15 | Item 2
3 | 2017-02-15 | 2017-04-01 | Item 3
I want to represent this as a bar chart with Month on the horizontal axis, and count on the vertical axis, where the value is computed by how many items fall within that month. In the above data set, January would have a value of 1, February would have a value of 3, March would have a value of 3, April would have a value of 2, and May would have a value of 1.
The closest I can get right now is to represent the count of items with the start or end date, but I want the month to represent how many items fall within that month.
I haven't found a way to do this in Tableau without restructuring my data set to have each current row restated for each month, which I don't have the luxury to do. Is this possible at all?
One solution could be to have 12 calculated fields like below
id | start | end | name | Jan | Feb | Mar | Apr | May...
1 | 2017-01-15 | 2017-03-30 | Item 1 | 1 | 1 | 1 | 0 | 0
2 | 2017-02-01 | 2017-05-15 | Item 2 | 0 | 1 | 1 | 1 | 1
3 | 2017-02-15 | 2017-04-01 | Item 3 | 0 | 1 | 1 | 1 | 0
Definition of calculated fields -
'Jan' is DATENAME('month',[start]) <= 1 & 1 <=
DATENAME('month',[end])
'Feb' is DATENAME('month',[start]) <= 2 & 2 <=
DATENAME('month',[end]) and so on...
Then using Pivot option in Tableau, convert it to something like
name | Month | Count
Item1 | Jan | 1
Item2 | Jan | 0
Item3 | Jan | 0
...
Item1 | Feb | 1
Item2 | Feb | 1
Item3 | Feb | 1
...
Item1 | Mar | 1
Item2 | Mar | 1
Item3 | Mar | 1
...
Drag Month to 'Columns' and SUM(Count) to 'Rows' to generate the final visualization.
Hope this helps!

Is it possible to do mathematical operations on values in the same column but different rows?

Say I have this table,
year | name | score
------+---------------+----------
2017 | BRAD | 5
2017 | BOB | 5
2016 | JON | 6
2016 | GUYTA | 2
2015 | PAC | 2
2015 | ZAC | 0
How would I go about averaging the scores by year and then getting the difference between years?
year | increase
------+-----------
2017 | 1
2016 | 3
You should use a window function, lead() in this case:
select year, avg, (avg - lead(avg) over w)::int as increase
from (
select year, avg(score)::int
from my_table
group by 1
) s
window w as (order by year desc);
year | avg | increase
------+-----+----------
2017 | 5 | 1
2016 | 4 | 3
2015 | 1 |
(3 rows)