How to sample an item from a variable - select

I have a large unbalanced panel dataset, which looks as follows:
clear
input year id income
2003 513 500
2003 517 500
2003 518 100
2003 525 900
2003 528 800
2003 531 0
2003 532 300
2003 534 600
2004 513 1000
2004 517 120
2004 523 300
2004 525 700
2004 528 800
2004 531 200
2004 532 600
2004 534 100
end
I want to randomly sample some people by id. id range has gaps in positive natural numbers (minimum 513 and maximum 287321 but there are some panel dropouts, i.e. 514, 515, 516).
I need to preserve the panel feature of the data. Therefore, if a random id is chosen, any year-id combination has to be kept. I do not need a random sample of the data (neither 10% or nor 10 observations). Rather I am interested in a random id-number from my id-column/variable, stored in a way that I can subsequently use it.
Thus, I am looking for a command like "pick one random value out of the given set of values from column ID". I subsequently want to use this randomly picked id in commands such as:
xtline income if id==X
Which is supposed to show me the income for all year of random person/ID X.

One way to get what you want is this:
clear
input year id var
2003 513 5
2003 517 5
2003 523 6
2003 525 9
2003 528 8
2003 531 0
2003 532 3
2003 534 6
2004 513 10
2004 517 12
2004 523 3
2004 525 7
2004 528 8
2004 531 2
2004 532 6
2004 534 1
end
bysort year (id): sample 3, count
list, sepby(year)
+------------------+
| year id var |
|------------------|
1. | 2003 523 6 |
2. | 2003 534 6 |
3. | 2003 531 0 |
|------------------|
4. | 2004 517 12 |
5. | 2004 523 3 |
6. | 2004 532 6 |
+------------------+
To sample a 10% you get rid of the count option:
bysort year (id): sample 10
EDIT:
To randomly select the same observations in all panels:
set seed 12345
generate random = runiform()
bysort id: replace random = random[1]
keep if random < 0.1
sort year (id)
list, sepby(year)
+-----------------------------+
| year id var random |
|-----------------------------|
1. | 2003 523 6 .0039323 |
2. | 2003 532 3 .0286627 |
|-----------------------------|
3. | 2004 523 3 .0039323 |
4. | 2004 532 6 .0286627 |
+-----------------------------+

This wasn't well explained, at least at first, but I think you want to select panels randomly. The method below selects first observations randomly and then extends any selection to each entire panel. It doesn't take account of the number of observations in any panel. Flagging a selection with -1 is just a minor device so that selected observations sort early. The magic number 5 -- replace with any magic number of panels -- is the number of panels selected (not a percent), which is what you're asking for.
clear
input float(year id income)
2003 513 500
2004 513 1000
2003 517 500
2004 517 120
2003 518 100
2004 523 300
2003 525 900
2004 525 700
2003 528 800
2004 528 800
2003 531 0
2004 531 200
2003 532 300
2004 532 600
2003 534 600
2004 534 100
end
list, sepby(id)
+---------------------+
| year id income |
|---------------------|
1. | 2003 513 500 |
2. | 2004 513 1000 |
|---------------------|
3. | 2003 517 500 |
4. | 2004 517 120 |
|---------------------|
5. | 2003 518 100 |
|---------------------|
6. | 2004 523 300 |
|---------------------|
7. | 2003 525 900 |
8. | 2004 525 700 |
|---------------------|
9. | 2003 528 800 |
10. | 2004 528 800 |
|---------------------|
11. | 2003 531 0 |
12. | 2004 531 200 |
|---------------------|
13. | 2003 532 300 |
14. | 2004 532 600 |
|---------------------|
15. | 2003 534 600 |
16. | 2004 534 100 |
+---------------------+
bysort id : gen byte first = -(_n == 1)
set seed 1776
gen rnd = runiform()
sort first rnd
gen wanted = _n <= 5
bysort id (wanted) : replace wanted = wanted[_N]
sort id year
list id year if wanted, sepby(id)
+------------+
| id year |
|------------|
7. | 525 2003 |
8. | 525 2004 |
|------------|
9. | 528 2003 |
10. | 528 2004 |
|------------|
11. | 531 2003 |
12. | 531 2004 |
|------------|
13. | 532 2003 |
14. | 532 2004 |
|------------|
15. | 534 2003 |
16. | 534 2004 |
+------------+

Related

Appending datasets by matched variables

I have to append three datasets named A, B and C that contain data for various years (for example, 1990, 1991...2014).
The problem is that not all datasets contain all the survey years and therefore the unmatched years need to be dropped manually before appending.
I would like to know if there is any way to append three (or more) datasets that will keep only the matched variables across the datasets (years in this case).
Consider the following toy example:
clear
input year var
1995 0
1996 1
1997 2
1998 3
1999 4
2000 5
end
save data1, replace
clear
input year var
1995 6
1996 9
1998 7
1999 8
2000 9
end
save data2, replace
clear
input year var
1995 10
1996 11
1997 12
2000 13
end
save data3, replace
There is no option that will force append to do what you want, but you can do the following:
use data1, clear
append using data2 data3
duplicates tag year, generate(tag)
sort year
list
+------------------+
| year var tag |
|------------------|
1. | 1995 0 2 |
2. | 1995 6 2 |
3. | 1995 10 2 |
4. | 1996 9 2 |
5. | 1996 1 2 |
|------------------|
6. | 1996 11 2 |
7. | 1997 2 1 |
8. | 1997 12 1 |
9. | 1998 7 1 |
10. | 1998 3 1 |
|------------------|
11. | 1999 8 1 |
12. | 1999 4 1 |
13. | 2000 13 2 |
14. | 2000 5 2 |
15. | 2000 9 2 |
+------------------+
drop if tag == 1
list
+------------------+
| year var tag |
|------------------|
1. | 1995 0 2 |
2. | 1995 6 2 |
3. | 1995 10 2 |
4. | 1996 9 2 |
5. | 1996 1 2 |
|------------------|
6. | 1996 11 2 |
7. | 2000 13 2 |
8. | 2000 5 2 |
9. | 2000 9 2 |
+------------------+
You can also further generalize this approach by finding the maximum value of the variable tag and keeping all observations with that value:
summarize tag
keep if tag == `r(max)'

Looking for a data compression implementation in C without dynamic allocation for STM32

I'm looking for a lossless data compression algorithm implementation that can run on a STM32L4. The data is ECG curves (so basically a set of 16 bits numerical values that are relatively close from one another).
I've found different implementations, for example miniz but they all use dynamic memory allocation (which I want to avoid) and are also pretty complicated and resource consuming.
I've read this post but there isn't really an answer. I'd like to avoid to modify an existing implementation to get rid of dynamic allocation, since this functionality (data compression) is not my main priority.
I don't need a fancy state of the art algorithm, but rather a simple, resource limited algorithm to save some bandwith when sending my data over the air, even if the compression ratio is not the best.
Do you have any idea of an algorithm that may fit ?
I was using https://github.com/pfalcon/uzlib
It uses malloc but it very easy to amend and use fixed size buffers.
Take a look a try.
On base of your data example, you can make your own and very simple compression, with no external library, faster and maybe with better compression ratio.
If you look on your data the difference between numbers is often less than size of 8 bit integer (int8_t), which can handle numbers betwee -128 .. + 127.
This mean that you can store always difference between these numbers if the range will be between -127 .. + 127.
Number -128 (0xff) can be magic and this mean that this number will be followed with one 16 bit number. This magic number will be also used as synchronization number and also on begin.
Or use instead of 8 bit number 4 bit number (but this will be little more complex) (magic number will be -8 and it store range in -7 .. +7. You store in one byte two numbers.
So, in case of your example:
input | output 8bit | output 4bit
int16 | int8 int16 | int4 int16
---------+---------------+---------------
-12 | -128 -12 | -8 -12
-12 | 0 | 0
-12 | 0 | 0
-11 | 1 | 1
-15 | -4 | -4
-8 | 7 | 7
-16 | -8 | -8 -16
-29 | -13 | -8 -29
28 | 57 | -8 28
169 | -128 169 | -8 141
327 | -128 327 | -8 158
217 | -110 | -8 217
-79 | -128 -79 | -8 -79
-91 | -12 | -8
-59 | 32 | -8 -59
-41 | 18 | -8 -41
-36 | 5 | 5
-29 | 7 | 7
-26 | 3 | 3
-24 | 2 | 2
-22 | 2 | 2
-19 | 3 | 3
-14 | 5 | 5
-14 | 0 | 0
-12 | 2 | 2
-10 | 2 | 2
-10 | 0 | 0
-5 | 5 | 5
-2 | 3 | 3
1 | 3 | 3
5 | 4 | 4
10 | 5 | 5
15 | 5 | 5
17 | 2 | 2
21 | 4 | 4
22 | 1 | 1
20 | -2 | -2
20 | 0 | 0
15 | -5 | -5
9 | -6 | -6
2 | -7 | -7
-6 | -8 | -8 -6
---------+---------------+---------------------
42 | 42 4 | 42 11 count
84 | 42 8 | 21 22 bytes
84 | 50 | 43 bytes total
100% | 60% | 51% compression ratio
So, as you can see, with very simple algorithm you can get very good result.
Also it is possible to find other improvements of this algorithm, for example group same data, or also compress 16 bit data data after magic number. for example after magic number you can specify number of followed 16 bit (uncompressed numbers)
But everything depends on your data.

SQL - Queries with GROUP BY?

I want to use a query to get this result:
Postid(20) ---> 4 type(0) et 2 type(1)
Postid(21) ---> 3 type(0) et 3 type(1).
From this table:
id | userid | postid | type
1 | 465 | 20 | 0
2 | 465 | 21 | 1
3 | 466 | 20 | 1
4 | 466 | 21 | 0
5 | 467 | 20 | 0
6 | 467 | 21 | 0
7 | 468 | 20 | 1
8 | 468 | 21 | 1
9 | 469 | 20 | 0
10 | 469 | 21 | 1
11 | 470 | 20 | 0
12 | 470 | 21 | 0
I think I have to use GROUP BY, I tried it but I get no results.
How can I achieve that result?
You need to use an aggregation function, alongside the columns you want to group by in the SELECT part.
Note: Any column that is selected alongside an aggregation function MUST come up in the GROUP BY section.
The following code should answer your question:
SELECT COUNT(id), postid, type FROM table_name GROUP BY postid, type
When using multiple GROUP BY columns, those entries that have all those columns in common will be grouped up, see here: https://stackoverflow.com/a/2421441/9743294

Multi page Report via JasperServer

I am working on report that contains multi page report. Every page having same column name and Logo, only data is different.
Let me try to give one example.
Report: Want to generate all Student's Mark for last 10 year. We thought there were 60 students. We only want Top 10 ranker marks.
Page 1:
Student Name: XYZ
Year | Subject1 | Subject2 | Subject3 |
Sem-1 2016 | 99 | 98 | 99 |
Sem-2 2016 | 98 | 99 | 98 |
TOTAL 2016 | 197 | 197 | 197 |
Page 2:
Student Name: PQR
Year | Subject1 | Subject2 | Subject3 |
Sem-1 2016 | 97 | 98 | 97 |
Sem-2 2016 | 98 | 97 | 98 |
TOTAL 2016 | 195 | 195 | 195 |
Same as for up to Top X Student for per page report.
Is it possible to create such type of report without merging separate report?
Thanks a lot.
I got the answer and Thanks to Fabio and Petter for help and giving right direction towards my solution.
I used Group Band to bifurcate the new records for another students and which displayed on another page.

PostgreSQL, complex query for calculating ingredients by recipe

I have to calculate food and ingredients of used food stored in PostgreSQL tables:
table1 'usedfood'
food food used used
code name qty meas
----------------------------------------------
10 spaghetti 3 pcs
156 mayonnaise 2 pcs
173 ketchup 1 pcs
172 bolognese sauce 2 pcs
173 ketchup 1 pcs
10 spaghetti 2 pcs
156 mayonnaise 1 pcs
table2 'ingredients'
food ingr. ingredient qty meas
code code name /1 in 1
----------------------------------------------
10 1256 spaghetti rinf 75 gramm
156 1144 salt 0.3 gramm
10 1144 salt 0.5 gramm
156 1140 fresh egg 50 gramm
172 1138 tomato 80 gramm
156 1139 mustard 5 gramm
172 1136 clove 1 gramm
156 1258 oil 120 gramm
172 1135 laurel 0.4 gramm
10 1258 oil 0.4 gramm
172 1130 corned beef 40 gramm
result:
used
times code food/ingredient qty meas
----------------------------------------------
5 1256 spaghetti rinf 375 gramm
8 1258 oil 362 gramm
2 1138 tomato 160 gramm
3 1140 fresh egg 150 gramm
2 1130 corned beef 80 gramm
3 1139 mustard 15 gramm
8 1144 salt 3.4 gramm
2 1136 clove 2 gramm
2 1135 laurel 0.8 gramm
2 173 ketchup 2 pcs //haven't any ingredients
For now I do this by looping through table1 and queriying table2 for each row then adding results and so on (with C) what may be very slow on larger data.
Table1 contains food code, food name and used quantity.
Table2 contains ingredients (in messy order) with code and used quantity for one peace of food and also code of food in which appears.
Used quantity from table1 should be multiplied with quantity from table2 according to every recipe and should be added to result of ingredients code.
So all ingredient rows which goes to food "spaghetti" starts with food spaghetti code (10).
Food without any ingredient should be calculated with quantity from table1 and showed with same name. That actually mean it is a final product (like beer bottle).
Here may be more complication but I'm, affraid to ask.
For example in ingredinets list may be ingredient which is recipe by himself. For example mustard which contains from vinegar, salt, seed, etc... What then?
In table2 of showed example mustard is used as ready product (component).
Is here any way for do such calculation and getting results fast with using just PostgreSQL which will give ready results to C program?
Maybe not too complex as seem's for me? How would that query look like?
Try
SELECT SUM(f.qty) used_times,
COALESCE(i.ingr_code, f.food_code) code,
COALESCE(i.name, f.name) name,
SUM(COALESCE(i.qty, 1) * f.qty) qty,
COALESCE(i.meas, f.meas) meas
FROM usedfood f LEFT JOIN ingredients i
ON f.food_code = i.food_code
GROUP BY i.ingr_code, i.name
Output:
| USED_TIMES | CODE | NAME | QTY | MEAS |
----------------------------------------------------
| 2 | 173 | ketchup | 2 | pcs |
| 2 | 1130 | corned beef | 80 | gramm |
| 2 | 1135 | laurel | 0.8 | gramm |
| 2 | 1136 | clove | 2 | gramm |
| 2 | 1138 | tomato | 160 | gramm |
| 3 | 1139 | mustard | 15 | gramm |
| 3 | 1140 | fresh egg | 150 | gramm |
| 8 | 1144 | salt | 3.4 | gramm |
| 5 | 1256 | spaghetti rinf | 375 | gramm |
| 8 | 1258 | oil | 362 | gramm |
Here is SQLFiddle demo