How to group by columns? - group-by

I am having trouble figuring out how to group rows by columns. My goal is to count the number of 'Package Codes' where column values are orange and blue.
I am working with thousands of rows of data. This is a subset of the data:
Country Package Code Color Type
US 100 Orange a
US 100 Orange b
US 100 Orange c
Mexico 200 Green d
US 300 Blue e
Canada 400 Red f
Germany 500 Red g
Germany 600 Blue h
Desired Output:
Country Packages
US 2
Mexico 0
Canada 0
Germany 1

Using isin + nunique + reindex
(df.loc[df.Color.isin(['Orange', 'Blue'])].groupby('Country')['Package Code']
.nunique().reindex(df.Country.unique(), fill_value=0)).to_frame('Total').reset_index()
Country Total
0 US 2
1 Mexico 0
2 Canada 0
3 Germany 1
Here is the above command broken down a bit for better readability:
# Select rows where the color is Orange or Blue
u = df.loc[df.Color.isin(['Orange', 'Blue'])]
# Find the unique values for Package Code, grouped by Country
w = u.groupby('Country')['Package Code'].nunique()
# Add in missing countries with a value of 0
w.reindex(df.Country.unique(), fill_value=0).to_frame('Total').reset_index()

Related

How to filter out bad values in a data set regarding a matrix in matlab?

I wanted to ask any keen users here how to "filter out" bad values regarding a tremendous amount of a data matrix in matlab.
e.g: I have a MATLAB data file containing values 2*5000 (double) which represent x and y coordinates. How is it possible to delete all values above or under a certain limit?
or easier:
(matrix from data file)
1 2 4 134 2
3 5 5 4 2
or
1 2 4 9 2
3 5 5 234 2
setting a certain limit and delete column:
1 2 4 2
3 5 5 2
Find the "bad" elements, e.g. A < 0 | A > 20
Find the "good" columns, e.g. ~max(A < 0 | A > 20)
Keep the "good" columns / Remove the "bad" columns, e.g. A(:, ~max(A < 0 | A > 20))

Merge macro tables in SAS

I'm a beginner in SAS so I am unfamiliar with syntax. I have two datasets that were created using macros.
(macro: https://gist.github.com/statgeek/2f27939fd72d1dd7d8c8669cd39d7e67)
DATA test1;
set sashelp.class;
if prxmatch('m/M/oi', sex);
female=ifn( sex='F',1,0);
RUN;
%table_char(test1, height weight age, sex, female, test1_table_char);
DATA test2;
set sashelp.class;
if prxmatch('m/F/oi', sex);
female=ifn( sex='F',1,0);
RUN;
%table_char(test2, height weight age, sex, female, test2_table_char);
Desired Output:
Male Female
Height
Count
Mean
Median
.
.
Weight
Count
Mean
Median
.
.
Sex
M
F
Etc
I wwould like to merge the two macro tables together created with %table_char by Name. How should I call th two tables so I can merge?
DATA final_merge;
merge test1_table_char test2_table_char;
by NAME;
RUN;
looks like you need to do is append the datasets.
data final;
set test1 test2;
run;
You need not split and merge datasets, you can simply do.
DATA final;
set sashelp.class;
female=ifn( sex='F',1,0);
RUN;
if you want to merge sort the datasets and then merge datasets
proc sort data =test1;
by your_variable;
run;
proc sort data =test2;
by your_variable;
run;
data final;
merge test1 test2
by your_variable;
run;
Combine or merge test1 and test2 by NAME:
proc sort data=work.test1;
by name;
run;
proc sort data=work.test2;
by name;
run;
data work.test;
merge work.test1 work.test2;
by name;
run;
Producing this result:
Name Sex Age Height Weight female
Alfred M 14 69.0 112.5 0
Alice F 13 56.5 84.0 1
Barbara F 13 65.3 98.0 1
Carol F 14 62.8 102.5 1
Henry M 14 63.5 102.5 0
James M 12 57.3 83.0 0
Jane F 12 59.8 84.5 1
Janet F 15 62.5 112.5 1
Jeffrey M 13 62.5 84.0 0
John M 12 59.0 99.5 0
Joyce F 11 51.3 50.5 1
Judy F 14 64.3 90.0 1
Louise F 12 56.3 77.0 1
Mary F 15 66.5 112.0 1
Philip M 16 72.0 150.0 0
Robert M 12 64.8 128.0 0
Ronald M 15 67.0 133.0 0
Thomas M 11 57.5 85.0 0
William M 15 66.5 112.0 0
Run macro with the merged output:
%table_char(test, height weight age, sex, female, test_table_char);
Produces the following result:
categorical value
Sex
F 9(47.4%)
M 10(52.6%)
Height
Count(Missing) 19(0)
Mean (SD) 62.3(5.1)
Median (IQR) 62.8(57.5 - 66.5
Range 51.3 - 72.0
90th Percentile 69.0
Weight
Count(Missing) 19(0)
Mean (SD) 100.0(22.8)
Median (IQR) 99.5(84.0 - 112.
Range 50.5 - 150.0
90th Percentile 133.0
Age
Count(Missing) 19(0)
Mean (SD) 13.3(1.5)
Median (IQR) 13.0(12.0 - 15.0
Range 11.0 - 16.0
90th Percentile 15.0
Female 9(47.4%)

Delete redundant entries within groups

I want to remove redundant rows in my database within each group (in this case datasource), which I define as rows which contains strictly less information or different information than some other row.
For example in the table below. Row 1 is redundant as the other row row 0 in its same group contains the exact same information as it but with more data.
For the same reason row 6 is redundant all the other rows 3, 4 and 5 in the group contains more information that it. However I keep both row 4 and 5 as they have some additional different information than the other rows in the group.
datasource city country
0 1 Shallotte US
1 1 None US
2 2 austin US
3 3 Casselberry US
4 3 None AU
5 3 Springfield None
6 3 None None
An example when there are more columns, rows 0 and 1, 4 are different information. However rows 2 and 3(or row 1) contains redundant information.
datasource city country Count
0 1 None US 11
1 1 austin None None
2 1 None None 11
3 1 austin None None
4 1 None CA None
Expected output
datasource city country Count
0 1 None US 11
1 1 austin None None
4 1 None CA None
Is there a simple way which I could achieve such logic in pandas or SQL (PostrgeSQL) for any number of columns?
Here's a different approach using the same basic strategy as Bharath shetty's solution. This way feels a bit neater to me.
First, construct the example data frame:
import pandas as pd
data = {"datasource": [1,1,2,3,3,3,3],
"city": ["Shallotte", None, "austin", "Casselberry", None, "Springfield", None],
"country": ["US", "US", "US", "US", "AU", None, None]}
df = pd.DataFrame(data)
df['null'] = df.isnull().sum(axis=1)
print(df)
city country datasource null
0 Shallotte US 1 0
1 None US 1 1
2 austin US 2 0
3 Casselberry US 3 0
4 None AU 3 1
5 Springfield None 3 1
6 None None 3 2
Now make a boolean mask using groupby and apply - we just drop the biggest null values per group:
def null_filter(d):
if len(d) > 1:
return d.null < d.null.max()
return d.null == d.null
mask = df.groupby("datasource").apply(null_filter).values
df.loc(mask).drop("null", 1)
Output:
city country datasource
0 Shallotte US 1
2 austin US 2
3 Casselberry US 3
4 None AU 3
5 Springfield None 3
One of the way is based on None Count and removing rows of maximum None values i.e
#Count the None values across the row
df['Null'] = (df.values == 'None').sum(axis=1)
#Get the maximum of the count based on groupby
df['Max'] = df.groupby('datasource')['Null'].transform(max)
# Get the values are not equal to max and equal to zero and drop the columns
df = df[~((df['Max'] !=0) & (df['Max'] == df['Null']))].drop(['Null','Max'],axis=1)
Output :
datasource city country
0 1 Shallotte US
2 2 austin US
3 3 Casselberry US
4 3 None AU
5 3 Springfield None
Hope it helps

TF-IDF and Rocchio classification in Introduction to Information Retrieval

I'm looking at table 14.1 from Vector Space Classification (chapter at link) in Introduction to Information Retrieval which Example 14.1 says "shows the tf-idf vector representations of the five documents in Table 13.1 using the formula (1 + log tf) * log(4/df) if tf > 0. Yet, when I look at Table 14.1, it does not appear that this TF-IDF formula is applied to the document vectors.
The documents from table 13.1 are:
1: Chinese Beijing Chinese
2: Chinese Chinese Shanghai
3: Chinese Macao
4: Tokyo Japan Chinese
and the term weights for the vectors in Table 14.1 are:
vector Chinese Japan Tokyo Macao Beijing Shanghai
d1 0 0 0 0 1.0 0
d2 0 0 0 0 0 1.0
d3 0 0 0 1.0 0 0
d4 0 0.71 0.71 0 0 0
If I apply the TF-IDF formula to the Japan dimension of d4, I get:
TF: 1 (term appears once in document 4)
DF: log(4 / 1) (term is present in only document 4)
TF-IDF Weight is thus: log(4) ~ .60
Why does my calculation outcome different from what is included in the text?
You have computed tf-idf correctly. The text is a bit misleading when it says
Table 14.1 shows the tf-idf vector representations of the five documents
in Table 13.1.
It is actually showing the tf-idf vector representations normalized to unit length.
The details:
Document 4 has three words "Tokyo", "Japan" and "Chinese".
You correctly computed that the TF-IDF weights for both "Tokyo" and "Japan"
should be
log10(4) ≈ 0.60. "Chinese" is in all documents, so the IDF part
of its weight is log(4/4) = 0 and the weight for "Chinese" is zero.
So the vector for document 4 is
Chinese Japan Tokyo Macao Beijing Shanghai
0 0.60 0.60 0 0 0
But the length of this vector is sqrt(0.60^2 + 0.60^2) ≈ 0.85 To get a vector of unit length, all components are divided by 0.85 giving the vector in the text
Chinese Japan Tokyo Macao Beijing Shanghai
0 0.71 0.71 0 0 0
It may be worth noting that the reason that we use vectors of unit length is to adjust for documents of different lengths. Without this adjustment, long documents would generally match queries better than short documents.

Count group of ones in series

I have a series a=[100 200 1 1 1 243 300 1 1 1 1 1 400 1 900 600 900 1 1 1 ]
I have to count how many times 1 occur when it occurs in group.
First group of 1's, sum is 3 (lying between 200 and 243).
Second group of ones lying between 300 and 400 is 5. Sum of all ones in each group is [3 5 1 3].
Please give me some suggestions.
Use diff on a==1. Bracket with false to assure the count is correct no matter what the starting or ending values of a. Finally, find the start and end of each run and subract:
d = diff([false, a==1, false]);
result = find(d==-1) - find(d==1);
In your example this gives
result =
3 5 1 3