I have a problem. Right now, a file that was supposed to be tab-delimited is missing a few "newlines"... My file looks something like this right now
Field1 Field2 Field3
Field1 Field2 Field3 Field1 Field2 Field3 Field1 Field2 Field3
Field1 Field2 Field3 Field1 Field2 Field3
Field1 Field2 Field3
Field1 Field2 Field3 Field1 Field2 Field3
Field1 Field2 Field3
I want to make it look uniform, with each "field1" starting at a new line
Field1 Field2 Field3
Field1 Field2 Field3
Field1 Field2 Field3
Field1 Field2 Field3
Field1 Field2 Field3
The problem is, each of these columns has a unique set of data, so I can't find a familiar place to split it into a new line. Any help is greatly appreciated!
PS: doing this in sed or tr would be greatly appreciated
PS: there can be up to 150 columns, not just 6 or 9 or any other multiple of 3
This might work for you:
sed 's/\s/\n/3;P;D' file
Explanation:
The third white space character (space or tab) is replaced by a newline s/\s/\n/3
The string upto the first newline is printed P
The string upto the first newline is deleted D
The D command has a split personality. If there is no newline it deletes the string and the next line is read in. If, however, a newline exists, it deletes the string upto the newline and then the cycle is started on the same string until no newlines exist.
This will work on the example you gave...
sed -e 's/\([^\t ]* [^\t ]* [^\t ]*\)[\t ]/\1\n/g'
Related
field1 field2 field3
name1 surname1 address1
name1 surename1 address1
name1 surename1 address1
name2 surename2 address2
name2 surename2 address2
name2 surename2 address2
...
In my select activity, it returns in data preview several fields.
There are duplicates and I would like to return distict rows.
After the select activity I have placed an aggregate activity.
Inside this aggregate activity the screenshot below
How is this done please?
You can use the column pattern in the aggregate transformation to remove duplicate rows from the source.
Source:
Aggregate transformation:
Column that matches: name != 'field1' && name!= 'field2' && name!= 'field3'
Aggregate output:
I want to count how many document that does not have date in field1, field2, field3, and field4. I have created the query as below but it does not really look good.
select
count(doc)
where true
and field1 is not null
and field2 is not null
and field3 is not null
and field4 is not null
How can I apply one filter for multiple columns?
Thanks in advance.
There is nothing at all wrong with your current query, and it is probably what I would be using here. However, you could use a COALESCE trick here:
SELECT COUNT(*)
FROM yourTable
WHERE COALESCE(field1, field2, field3, field4) IS NOT NULL;
This works because for any record having at least one of the four fields assigned to a non NULL date would fail the IS NOT NULL check. Only records for which all four fields are NULL would match.
Note that this counts records having at least one non NULL field. If instead you want to count records where all four fields are NULL, then use:
SELECT COUNT(*)
FROM yourTable
WHERE COALESCE(field1, field2, field3, field4) IS NULL;
I have such table (for example):
Field1
Field2
Field3
Field4
.....
1
a
c
c
1
a
x
c
1
a
c
c
2
a
y
j
2
b
y
k
2
b
y
l
I need to select by one field by one value and compare all fields in selected rows, like SELECT * WHERE Filed1=1.....COMPARE
I would like to have a result like:
Field1
Field2
Field3
Field4
.....
true
true
false
true
This should work for fixed columns and if there are no NULL values:
SELECT
COUNT(DISTINCT t.col1) = 1,
COUNT(DISTINCT t.col2) = 1,
COUNT(DISTINCT t.col3) = 1,
...
FROM mytable t
WHERE t.filter_column = 'some_value'
GROUP BY col1;
If you have some nullable columns, perhaps you could give it a try with something like this instead of the COUNT(DISTINCT t.<colname>) = 1:
BOOL_AND(NOT EXISTS(
SELECT 1
FROM mytable t2
WHERE t2.filter_column = 'some_value'
AND t2.<colname> IS DISTINCT FROM t.<colname>
))
If you do not have fixed columns, you should try to build up a dynamic query by a function taking as parameters the tablename, the name of the filter-column and the value for the filter.
Another remark: If you remove the filter (the condition t.filter_column = 'some_value') and add another output column as just t.filter_column, you should be able to recieve the result of this query for all distinct values in your filter-column.
I have table structure as below
FIELD1 FIELD2 FIELD3 FIELD4
ID001 AB 1 R
ID001 CD 2 R
ID002 AB 1 R
ID002 CD 3 R
ID002 EF 4 R
ID003 AB 1 R
ID003 CD 2 R
ID003 PQ 4 R
ID004 PQ 1 R
ID004 RS 2 R
Input I am getting from the other resource is like this-:
Field2, field3 and field4 will be the input. Field2 and field3 will be sent in combination. Field 4 will be sent once.
Input 1-((AB,1,CD,2),R)
Input 2-((AB,1,CD,2,PQ,4),R)
For this I should get field1 as the output.
For input 1, it should return ID001
For input 2, it should return ID003.
Can anybody help me out for this?
The whole requirement is to get field1 from other fields.
It works using the XML aggregation capabilities of DB2, and the input parameters as concatenated filter string:
select field1 from
(
select
field1,
xmlcast(xmlgroup(field2 || field3 as a) as varchar(15)) as fields23,
field4
from
your_table
group by
field1, field4
)
where
fields23 = 'AB1CD2' and field4 = 'R';
For the "Input 2" case use this filter:
...
where
fields23 = 'AB1CD2PQ4' and field4 = 'R';
Based on this blog entry: https://www.ibm.com/developerworks/community/blogs/SQLTips4DB2LUW/entry/aggregating_strings42?lang=en
A system wraps lines in a log file if they exceed X characters. I am trying to extract various data from the log, but first I need to combine all the split lines so gawk can parse the fields as a single record.
For example:
2012/11/01 field1 field2 field3 field4 fi
eld5 field6 field7
2012/11/03 field1 field2 field3
2012/12/31 field1 field2 field3 field4 fi
eld5 field6 field7 field8 field9 field10
field11 field12 field13
2013/01/10 field1 field2 field3
2013/01/11 field1 field2 field3 field4
I want to return
2012/11/01 field1 field2 field3 field4 field5 field6 field7
2012/11/03 field1 field2 field3
2012/12/31 field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 field11 field12 field13
2013/01/10 field1 field2 field3
2013/01/11 field1 field2 field3 field4
The actual max line length in my case is 130. I'm reluctant to test for that length and use getline to join the next line, in case there is a entry that is exactly 130 chars long.
Once I've cleaned up the log file, I'm also going to want to extract all the relevant events, where "relevant" may involve criteria like:
'foo' is anywhere in any field in the record
field2 ~ /bar|dtn/
if field1 ~ /xyz|abc/ && field98 == "0001"
I'm wondering if I will need to run two successive gawk programs, or if I can combine all of this into one.
I'm a gawk newbie and come from a non-Unix
$ awk '{printf "%s%s",($1 ~ "/" ? rs : ""),$0; rs=RS} END{print ""}' file
2012/11/01 field1 field2 field3 field4 field5 field6 field7
2012/11/03 field1 field2 field3
2012/12/31 field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 field11 field12 field13
2013/01/10 field1 field2 field3
2013/01/11 field1 field2 field3 field4
Now that I've noticed you don't actually want to just print recombined records, here's an alternative way to do that that's more amenable to test on the recompiled record ("s" in this script:
$ awk 'NR>1 && $1~"/"{print s; s=""} {s=s $0} END{print s}' file
Now with that structure, instead of just printing s you can perform tests on s, for example (note "foo" in 3rd record):
$ cat file
2012/11/01 field1 field2 field3 field4 fi
eld5 field6 field7
2012/11/03 field1 field2 field3
2012/12/31 field1 field2 foo field4 fi
eld5 field6 field7 field8 field9 field10
field11 field12 field13
2013/01/10 field1 field2 field3
2013/01/11 field1 field2 field3 field4
$ awk '
function tst(rec, flds,nf,i) {
nf=split(rec,flds)
if (rec ~ "foo") {
print rec
for (i=1;i<=nf;i++)
print "\t",i,flds[i]
}
}
NR>1 && $1~"/" { tst(s); s="" }
{ s=s $0 }
END { tst(s) }
' file
2012/12/31 field1 field2 foo field4 field5 field6 field7 field8 field9 field10 field11 field12 field13
1 2012/12/31
2 field1
3 field2
4 foo
5 field4
6 field5
7 field6
8 field7
9 field8
10 field9
11 field10
12 field11
13 field12
14 field13
gawk '{ gsub( "\n", "" ); printf $0 RT }
END { print }' RS='\n[0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9]' input
This can be somewhat simplified with:
gawk --re-interval '{ gsub( "\n", "" ); printf $0 RT }
END { print }' RS='\n[0-9]{4}/[0-9]{2}/[0-9]{2}' input
This might work for you (GNU sed):
sed -r ':a;$!N;\#\n[0-9]{4}/[0-9]{2}/[0-9]{2}#!{s/\n//;ta};P;D' file
Here's a slightly bigger Perl solution which also handles the additional filtering (as you tagged this perl as well):
root#virtualdeb:~# cat combine_and_filter.pl
#!/usr/bin/perl -n
if (m!^2\d{3}/\d{2}/\d{2} !){
print $prevline if $prevline =~ m/field13/;
$prevline = $_;
}else{
chomp($prevline);
$prevline .= $_
}
root#virtualdeb:~# perl combine_and_filter < /tmp/in.txt
2012/12/31 field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 field11 field12 field13
this may work for you:
awk --re-interval '/^[0-9]{4}\//&&s{print s;s=""}{s=s""sprintf($0)}END{print s}' file
test with your example:
kent$ echo "2012/11/01 field1 field2 field3 field4 fi
eld5 field6 field7
2012/11/03 field1 field2 field3
2012/12/31 field1 field2 field3 field4 fi
eld5 field6 field7 field8 field9 field10
field11 field12 field13
2013/01/10 field1 field2 field3
2013/01/11 field1 field2 field3 field4"|awk --re-interval '/^[0-9]{4}\//&&s{print s;s=""}{s=s""sprintf($0)}END{print s}'
2012/11/01 field1 field2 field3 field4 field5 field6 field7
2012/11/03 field1 field2 field3
2012/12/31 field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 field11 field12 field13
2013/01/10 field1 field2 field3
2013/01/11 field1 field2 field3 field4