print column and count the string - perl

I have large column wise text file with space demlimited
Name subject Result
John maths pass
John science fail
John history pass
John geography pass
Jack maths pass
jack history fail
kelly science pass
kelly history pass
I want to count for each name (it is long name list, each name should be appear only once), how many of them pass. For eg. For John, he passed 3 and similarily for Jack he passed 1. It should print the result as
Name Passcount
John 3
Jack 1
Kelly 2
Can anybody can help with awk or perl script. Thanks in advance

You can try something like this -
awk '
BEGIN{ print "Name\tPasscount"}
NR>1{if ($3=="pass") a[$1]++}
END{ for (x in a) print x"\t"a[x]}' file
Test:
$ cat file
Name subject Result
John maths pass
John science fail
John history pass
John geography pass
Jack maths pass
jack history fail
kelly science pass
kelly history pass
$ awk 'BEGIN{ print "Name\tPasscount"} NR>1{if ($3=="pass") a[$1]++}END{ for (x in a) print x"\t"a[x]}' file
Name Passcount
Jack 1
kelly 2
John 3

Related

getting the "title" between the names

The name I'm working on is formatted like this:
King, Mr. Jay Thomas
Smith, Miss. Jane
How do I get the middle title part only using Postgres?
I'm a noob so this is definitely wrong:
SELECT position('%#," #"%#' for '#') AS TITLE
FROM titanic;`
You could use SUBSTRING with the regex pattern \w+\.:
SELECT SUBSTRING(title from '\w+\.')
FROM titanic;

SAS hash table - multivalues for single key and find_next

I am using a hash table to look up values and determine whether there is a match. These are the testing codes.
What I am trying to do is to check whether students in roster have entries in grade_roster; if a student has entries in grade_roster, I attach that grade_on_report to a new dataset that is the dup of roster.
name is my key and the grades are values. One name may have several grades, i.e. multivalues for a single key. I was able to find all names in the roster that have a match in grade_roster using find_next(), but I could not attach the right grade to the new dataset.
It seems like whenever find_next() is called, the value for the key was set to the next item in the list, and therefore assign that value to all previous keys.
Here is my code:
data roster;
input name $ course $ grade_on_paper $;
datalines;
Mary English A
Mary German B
Josh English B
Lily Spanish B
Lucy Physics C
John Music A
Eric Math A
Eric Music B
;
run;
data grade_roster;
input name $ course $ grade_on_report $;
datalines;
Mary English A
Mary German B
John Music A
Eric Math A
Eric Music B
;
run;
data assign_grade;
set roster;
format grade_on_report $1.;
declare hash ht1(dataset:"grade_roster", multidata:"Y");
ht1.defineKey("name");
ht1.defineData("grade_on_report");
ht1.defineDone();
rc = ht1.find();
do while(rc = 0);
rc = ht1.find_next();
end;
run;
What I got was this:
name course grade_on_paper grade_on_report name_found
1 Mary English A B Y
2 Mary German B B Y
3 Josh English B
4 Lily Spanish B
5 Lucy Physics C
6 John Music A A Y
7 Eric Math A B Y
8 Eric Music B B Y
What I want is:
name course grade_on_paper grade_on_report name_found
1 Mary English A A Y
2 Mary German B B Y
3 Josh English B
4 Lily Spanish B
5 Lucy Physics C
6 John Music A A Y
7 Eric Math A A Y
8 Eric Music B B Y
Note: name and course together are not unique identifiers. It seems like they are unique identifiers in this particular test code but they are not unique identifiers in the actual dataset I am working on. The goal is to use name as the only key in ht1.defineKey() and get the correct result.
Any help would be appreciated. Thank you!
If you are trying to find data on a specific student and course, you need to make both name and course part of the key.
data assign_grade;
set roster;
format grade_on_report $1.;
declare hash ht1(dataset:"grade_roster", multidata:"Y");
ht1.defineKey("name", "course");
ht1.defineData("grade_on_report");
ht1.defineDone();
rc = ht1.find();
do while(rc = 0);
rc = ht1.find_next();
end;
run
I tested this on sligthly different data
data roster;
input name $ course $ grade_on_paper $;
datalines;
Mary English A
Mary German B
Josh English B
Lily Spanish B
Lucy Physics C
John Music A
Eric Math A
Eric Music B
;
run;
data grade_roster;
input name $ course $ grade_on_report $;
datalines;
Mary German B
Mary English A
John Music A
Eric Math C
Eric Music B
;
run;
and got what I expected.

I have a Tab separated file in Unix which has data issue

I have to make sure each line has 4 columns, but the input data is quite a mess:
The first line is header.
The second line is valid as it has 4 columns.
The third is also valid (it's ok if the description field is null)
ID field and "god bless me" Last column PNumber is are not null fields.
As one can see 4th line is messed up because of newline character in "Description column" it spanned across multiple lines.
ID Name Description Phnumber
1051 John 5674 I am doing good, is this task we need to fix 908342
1065 Rohit 9876246
10402 rob I am
doing good,
is this task we need to fix 908341
105552 "Julin rob hain" i know what to do just let me do it
"
"
"
"
"
"
908452 1051 Dave I am doing reporting this week 88889999
Maybe a screenshot will make it easier to see the problem
Each line will start with a number and ends with a number. Each line should have 4 columns.
desired output
ID Name Description Phnumber
1051 John 5674 I am doing good, is this task we need to fix 908342
1065 Rohit 9876246
10402 rob I am doing good, 563 is this task we need to fix 908341
105552 "Julin rob hain" i know what to do just let me do it 908452
1051 Dave I am doing reporting this week 88889999
The data is sample data the actual file has 12 columns. yes in between columns can have numbers and few are date fields (like 2017-03-02)
This did the trick
cat file_name | perl -0pe 's/\n(?!([0-9]{6}|$)\t)//g' | perl -0pe 's/\r(?!([0-9]{6}|$)\t)//g' | sed '/^$/d'
awk to the rescue!
assumes the all digit fields don't appear except first and last fields
awk 'NR==1;
NR>1 {for(i=1;i<=NF;i++)
{if($i~/[0-9]+/) s=!s; printf "%s", $i (s?OFS:RS)}}' file
ID Name Description Phnumber
1051 John I am doing good, is this task we need to fix 908342
10423 rob I am doing good, is this task we need to fix 908341
1052 Julin rob hain i know what to do just let me do it " " " " " " 908452
1051 Dave I am doing reporting this week 88889999
perhaps set the OFS to \t to have more structure

Tcl scripting using grep,sed

I have to find a particular string first, then i need to pick up a value in that string, which is changing aftrr every run. How can i do this.
Stage pro begin_time { size is 54mb and 3 sec }
Stage pro1 begin_time {.....}
I have very big file. I have to specifically look for begin_time string first then look for its value and store it somewhere. How can i do this.
I have tried it using grep but i don't know how to store value of 3 here. Similarly i have many cases here, how can i handle this situation
Thanks for the reply. I want to store it as value 3 with respect to pro, then some other value corresponding to pro1. Second, time mentioned here is a start time. I have same information for each of the stage pro,pro1 but with an end time. I have to extract end time as well. Then i have to caluculate net time elapsed, i.e. time between start time and end time
cat aaa
Stage pro begin_time { size is 54mb and 3 sec }
Stage pro1 begin_time {.....}
cat aaa |awk '$3=="begin_time" {print $9}' |grep .
3
if you wish to store it :
var=$(cat aaa |awk '$3=="begin_time" {print $9}' |grep .)
echo $var
3
If you wish to map pro with 3 then,
cat aaa | awk '$3=="begin_time" {print $2,$9}' |grep .
pro 3

BASH: comm (or similar) when compare multiple files

I've the following problem: I would like to compare the content of 8 files contaning a list like this
Sample1.txt Sample2.txt Sample3.txt
apple pineapple apple
pineapple apple pineapple
bananas bananas bananas
orange orange mango
grape nuts nuts
using comm Sample1.txt Sample 2.txt I can have something like this
grape nuts apple
pineapple
bananas
orange
meaning that in the first column I have something related only to the first sample, the second column the things related only to the second sample and the third column the things in common.
I would like to do the same but with 8 files (sample). With diff it is not possible but at the end I would like to have
Sample1 Sample2 Sample3 ...Sample8 Things in common
grape nuts mango apple
pineapple
bananas
Is there a chance to do it with bash? Is there a command like diff that allow the searching for differences on more than two files?
Thank you to everybody...I know this is a challenging question
Fabio
Here is my naive solution:
first=sample1.txt; for a in *.txt; do comm -12 $first $a >temp_$a; echo "comparing" $first " " $a "and writing to temp_$a"; first=temp_$a; cat temp_$a; done;