I have a Tab separated file in Unix which has data issue - perl

I have to make sure each line has 4 columns, but the input data is quite a mess:
The first line is header.
The second line is valid as it has 4 columns.
The third is also valid (it's ok if the description field is null)
ID field and "god bless me" Last column PNumber is are not null fields.
As one can see 4th line is messed up because of newline character in "Description column" it spanned across multiple lines.
ID Name Description Phnumber
1051 John 5674 I am doing good, is this task we need to fix 908342
1065 Rohit 9876246
10402 rob I am
doing good,
is this task we need to fix 908341
105552 "Julin rob hain" i know what to do just let me do it
"
"
"
"
"
"
908452 1051 Dave I am doing reporting this week 88889999
Maybe a screenshot will make it easier to see the problem
Each line will start with a number and ends with a number. Each line should have 4 columns.
desired output
ID Name Description Phnumber
1051 John 5674 I am doing good, is this task we need to fix 908342
1065 Rohit 9876246
10402 rob I am doing good, 563 is this task we need to fix 908341
105552 "Julin rob hain" i know what to do just let me do it 908452
1051 Dave I am doing reporting this week 88889999
The data is sample data the actual file has 12 columns. yes in between columns can have numbers and few are date fields (like 2017-03-02)

This did the trick
cat file_name | perl -0pe 's/\n(?!([0-9]{6}|$)\t)//g' | perl -0pe 's/\r(?!([0-9]{6}|$)\t)//g' | sed '/^$/d'

awk to the rescue!
assumes the all digit fields don't appear except first and last fields
awk 'NR==1;
NR>1 {for(i=1;i<=NF;i++)
{if($i~/[0-9]+/) s=!s; printf "%s", $i (s?OFS:RS)}}' file
ID Name Description Phnumber
1051 John I am doing good, is this task we need to fix 908342
10423 rob I am doing good, is this task we need to fix 908341
1052 Julin rob hain i know what to do just let me do it " " " " " " 908452
1051 Dave I am doing reporting this week 88889999
perhaps set the OFS to \t to have more structure

Related

Regexmatch PCRE parse last name

I'm trying to parse the first and last name into their own variables. I figured out how to get the first name, but I can't get the last name. Any help would be appreciated. I'm reading the data off the clipboard.
Clipboard = John Smith
RegExMatch(clipboard, "(^([\w\-]+))", FirstName)
RegExMatch(clipboard, " ", LastName)
For this very specific example, you could probably use $ at the end instead of ^ at the beginning, as in "(([\w\-]+)$)".

use perl to extract specific output lines

I'm endeavoring to create a system to generalize rules from input text. I'm using reVerb to create my initial set of rules. Using the following command[*], for instance:
$ echo "Bananas are an excellent source of potassium." | ./reverb -q | tr '\t' '\n' | cat -n
To generate output of the form:
1 stdin
2 1
3 Bananas
4 are an excellent source of
5 potassium
6 0
7 1
8 1
9 6
10 6
11 7
12 0.9999999997341693
13 Bananas are an excellent source of potassium .
14 NNS VBP DT JJ NN IN NN .
15 B-NP B-VP B-NP I-NP I-NP I-NP I-NP O
16 bananas
17 be source of
18 potassium
I'm currently piping the output to a file, which includes the preceding white space and numbers as depicted above.
What I'm really after is just the simple rule at the end, i.e. lines 16, 17 & 18. I've been trying to create a script to extract just that component and put it to a new file in the form of a Prolog clause, i.e. be source of(banans, potassium).
Is that feasible? Can Prolog rules contain white space like that?
I think I'm locked into getting all that output from reVerb so, what would be the best way to extract the desirable component? With a Perl script? Or maybe sed?
*Later I plan to replace this with a larger input file as opposed to just single sentences.
This seems wasteful. Why not leave the tabs as they are, and use:
$ echo "Bananas are an excellent source of potassium." \
| ./reverb -q | cut --fields=16,17,18
And yes, you can have rules like this in Prolog. See the answer by #mat. You need to know a bit of Prolog before you move on, I guess.
It is easier, however, to just make the string a a valid name for a predicate:
be_source_of with underscores instead of spaces
or 'be source of' with spaces, and enclosed in single quotes.
You can use probably awk to do what you want with the three fields. See for example the printf command in awk. Or, you can parse it again from Prolog directly. Both are beyond the scope of your current question, I feel.
sed -n 'N;N
:cycle
$!{N
D
b cycle
}
s/\(.*\)\n\(.*\)\n\(.*\)/\2 (\1,\3)/p' YourFile
if number are in output and not jsut for the reference, change last sed action by
s/\^ *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)/\2 (\1,\3)/p
assuming the last 3 lines are the source of your "rules"
Regarding the Prolog part of the question:
Yes, Prolog facts can contain whitespace like this, with suitable operator declarations present.
For example:
:- op(700, fx, be).
:- op(650, fx, source).
:- op(600, fx, of).
Example query and its result, to let you see the shape of terms that are created with this syntax:
?- write_canonical(be source of(a, b)).
be(source(of(a,b))).
Therefore, with these operator declarations, a fact like:
be source of(a, b).
is exactly the same as stating:
be(source(of(a,b)).
Depending on use cases and other definitions, it may even be an advantage to create this kind of facts (i.e., facts of the form be/1 instead of source_of/2). If this is the only kind of facts you need, you can simply write:
source_of(a, b).
This creates no redundant wrappers and is easier to use.
Or, as Boris suggested, you can use single quotes as in 'be source of'/2.

emulate SAS' datastep statement FIRST using linux command line tools

Let's say I have the first column of the following dataset in a file and I want to emulate the flag in the second column so I export only that row tied to a flag = 1 (dataset is pre-sorted by the target column):
1 1
1 0
1 0
2 1
2 0
2 0
I could run awk 'NR==1 {print; next} seen[$1]++ {print}' dataset but would run into a problem for very large files (seen keeps growing). Is there an alternative to handle this without tracking every single unique value of the target column (here column #1)? Thanks.
So you only have the first column? And would like to generate the second? I think a slightly different awk command could work
awk '{if (last==$1) {flag=0} else {last=$1; flag=1}; print $0,flag}' file.txt
Basically you just check if the first field matches the last one you've seen. Since it's sorted, you don't have to keep track of everything you've seen, only the last one to know if the value is different.
Seems like grep would be fine for this:
$ grep " 1" dataset

Output particular sections of txt file to csv?

Say I have a txt file like below (it's obviously not 'text text text', I'm just showing that it's blocks of irrelevant text)
text text text
text text text
text text text
important section age=30
name=mike
text text text
text text text
text text text
I want to parse it and output only the 'important section' to csv so that my csv would look like below, i.e. age in one column and name in another
age name
30 mike
How should I go about this? Perl? Sed? I'm not that familiar with either but hoping there is a straightforward enough solution.
Choroba actually answered the above perfectly for me but I fear I oversimplified my actual text file too much, it is more like below
Something:
this
Something else:
that
Something else:
etc.
Sales
2011 Sales:
€3,000
()
2010 Sales:
€2,000
()
2011 Growth Rate:
50.00%
Contact Details
And the output I would ideally like is
2011 Sales 2010 Sales 2011 Growth Rate
3,000 2,000 50.00%
This, unfortunately, greatly complicates things. The output doesn't have to be exactly like above but as close as possible
Perl solution. It keeps a flag telling whether we are in the important section. Everything important is remembered in an array and printed at the end:
perl -nE '$i = 1 if s/important section //;
push #t, [$1, $2] if $i and /(.*)=(.*)/;
}{
for my $i (0, 1) {
say join "\t", map $_->[$i], #t
}' file.txt

print column and count the string

I have large column wise text file with space demlimited
Name subject Result
John maths pass
John science fail
John history pass
John geography pass
Jack maths pass
jack history fail
kelly science pass
kelly history pass
I want to count for each name (it is long name list, each name should be appear only once), how many of them pass. For eg. For John, he passed 3 and similarily for Jack he passed 1. It should print the result as
Name Passcount
John 3
Jack 1
Kelly 2
Can anybody can help with awk or perl script. Thanks in advance
You can try something like this -
awk '
BEGIN{ print "Name\tPasscount"}
NR>1{if ($3=="pass") a[$1]++}
END{ for (x in a) print x"\t"a[x]}' file
Test:
$ cat file
Name subject Result
John maths pass
John science fail
John history pass
John geography pass
Jack maths pass
jack history fail
kelly science pass
kelly history pass
$ awk 'BEGIN{ print "Name\tPasscount"} NR>1{if ($3=="pass") a[$1]++}END{ for (x in a) print x"\t"a[x]}' file
Name Passcount
Jack 1
kelly 2
John 3