removing double quotes from a file - perl

I have a tab delimited file, that looks like this.
"""chr1" "38045559" "38046059" "C1orf122"""
"""" "" "" "C1orf122"""
"""" "" "" "YRDC"""
"""chr1" "205291045" "205291545" "YOD1"""
"""chr1" "1499717" "1500625" "SSU72"""
I got this file after converting a .csv to tab separated file from this command
perl -lpe 's/"/""/g; s/^|$/"/g; s/","/\t/g' <test.csv>test_tab
Now, I want my file to remain tab separated but all the extra quotes should be removed from the file. But at the same time when I print column 4 I should get all the names and for column 1,2, and 3 the co ordinates (this I still get it but with quotes).
What manipulation shall I do in above command to do so, kindly guide.
The output desired is (since I was asked to be clear)
chr1 38045559 38046059 C1orf122
C1orf122
YRDC
chr1 205291045 205291545 YOD1
chr1 1499717 1500625 SSU72
so that when I extract Column 4 I should get
C1orf122
C1orf122
YRDC
YOD1
SSU72
Thank you

It appears that most of those quotes are being inserted by your command to bring in the file. Instead open the file normally:
use strict;
use warnings;
open CSV, 'test.csv' or die "can't open input file.";
open TAB, '>test.tab' or die "can't open output file.";
my #row_array;
while (<CSV>)
{
#Remove any quotes that exist on the line (it is in default variable $_).
s/"//g;
#Split the current row into an array.
my #fields = split /,/;
#write the output, tab-delimited file.
print TAB join ("\t", #fields) . "\n";
#Put the row into a multidimensional array.
push #row_array, \#fields;
}
print "Column 4:\n";
print $_->[3] . "\n" foreach (#row_array);
print "\nColumns 1-3:\n";
print "#{$_}[0..2]\n" foreach (#row_array);
Any quotes that still do exist will be removed by s/"//g; in the above code. This will remove all quotes; it doesn't check whether they are at the beginning and end of a field. If you might have some quotes within the data that you need to preserve, you would need a more sophisticated matching pattern.
Update: I added code to create a tab-separated output file, since you seem to want that. I don't understand exactly what your requirement related to getting "all the names...and the coordinates" is. However, you should be able to use the above code for that. Just add what you need where it says "do stuff". You can reference, for example, column 1 with $fields[0].
Update 2: Added code to extract column 4, then columns 1-3. The syntax for using multidimensional arrays is tricky. See perldsc and perlref for more information.
Update 3: Added code to remove the quotes that still exist in your file.

Related

How to split a file (with sed) into numerous files according to a value found on each line?

I have several Company_***.csv files (altough the separator's a tab not a comma; hence should be *.tsv, but never mind) which contains a header plus numerous data lines e.g
1stHeader 2ndHeader DateHeader OtherHeaders...
111111111 SOME STRING 2020-08-01 OTHER STRINGS..
222222222 ANOT STRING 2020-08-02 OTHER STRINGS..
I have to split them according to the 3rd column here, it's a date.
Each file should be named like e.g. Company_2020_08_01.csv Company_2020_08_02.csv & so one
and containing: same header on the 1st line + matching rows as the following lines.
At first I thought about saving (once) the header in a single file e.g.
sed -n '1w Company_header.csv' Company_*.csv
then parsing the files with a pattern for the date (hence the headers would be skipped) e.g.
sed -n '/\t2020-[01][0-9]-[0-3][0-9]\t/w somefilename.csv' Company_*.csv
... and at last, insert the (missing) header in each generated file.
But I'm stuck at step 2: I can't find how I could generate (dynamically) the "filename" expected by the w command, neither how to capture the date in the search pattern (because apparently this is just an address, not a search-replace "field" as in the s/regexp/replacement/[flags] command, so you can't have capturing groups ( ) in there).
So I wonder if this is actually doable with sed? Or should I look upon other tools e.g. awk?
Disclaimer: I'm quite a n00b with these commands so I'm just learning/starting from scratch...
Perl to the rescue!
perl -e 'while (<>) {
$h = $_, next if $. == 1;
$. = 0 if eof;
#c = split /\t/;
open my $out, ">>", "Company_" . $c[2] =~ tr/-/_/r . ".csv" or die $!;
print {$out} $h unless tell $out;
print {$out} $_;
}' -- Company_*.csv
The diamond operator <> in scalar context reads a line from the input.
The first line of each file is stored in the variable $h, see $. and eof
split populates the #c array by the column values for each line
$c[2] contains the date, using tr we translate dashes to underscores to create a filename from it. open opens the file for appending.
print prints the header if the file is empty (see tell)
and prints the current line, too.
Note that it only appends to the files, so don't forget to delete any output files before running the script again.

Bulk insert csv - column value itself has commas

The Activity column has values that have commas in the text
Some records in the csv :
Name,Project,Activity,Hrs
John,,,7.1
,Junkie,,7.1
,,Reviewing the,file,7.1 //This is under 'Activity' column and it has a comma in the text
When I use the bulk insert, i get 'bulk load data conversion for this line. If this line is removed or the comma in that sentence is removed, it works all good.
Please let me know what are the options in this case. I have many csv files, and each might have many such values.
If I had this particular issue and the creation of the CSV files was not under my control, I would resort to a Perl script like this:
open(my $fhin, "<", "MyFile.csv");
open(my $fhout, ">", "MyQFile.csv");
while (my $line = <$fh>) {
chomp($line);
$line =~ s/^([^,]*),([^,]*),(.*),([^,]*)$/\"$1\",\"$2\",\"$3\",\"$4\"/;
print $fhout $line . "\n";
}
Note that the above regular expression can handle only one "problem" column of this kind. If there are any others, there is no possibility of programmatically assigning correct quotation to such columns (without more information...).
I had a similar issue where a text string had a comma in the string. I used the following field terminator to resolve.
FIELDTERMINATOR = '\t'
This does not work on CSV and I had to save my files as .txt.

Perl: string formatting in tab delimited file

I have no background in programming whatsoever, so I would appreciate it if you would explain how and why any code you recommend should be written the way it is.
I have a data matrix 2,000+ samples, and need to do the following manipulate the format in one column.
I would also like to manipulate the format of one of the columns so that it is easier to merge with my other matrix. For example, one column is known as sample number (column #16). The format is currently similar to ABCD-A1-A0SD-01A-11D-A10Y-09, yet I would like to change it to be formatted to the following ABCD-A1-A0SD-01A. This will allow me to have it in the right format so that I can merge it with another matrix. I seem to not be able to find any information on how to proceed with this step.
The sample input should look like this:
ABCD-A1-A0SD-01A-11D-A10Y-09
ABCD-A1-A0SD-01A-11D-A10Y-09
ABCD-A1-A0SE-01A-11D-A10Y-09
ABCD-A1-A0SE-01A-11D-A10Y-09
ABCD-A1-A0SF-01A-11D-A10Y-09
ABCD-A1-A0SH-01A-11D-A10Y-09
ABCD-A1-A0SI-01A-11D-A10Y-09
I want the last three extensions removed. The output sample should look like this:
ABCD-A1-A0SD-01A
ABCD-A1-A0SD-01A
ABCD-A1-A0SE-01A
ABCD-A1-A0SE-01A
ABCD-A1-A0SF-01A
ABCD-A1-A0SH-01A
ABCD-A1-A0SI-01A
Finally, the matrix that I want to merge with has a different layout, in other words the number of columns and rows are different. This is a issue when I tackle the next step which is merging the two matrices together. The original matrix has about 52 columns and 2,000+ rows, whereas the merging matrix only has 15 column and 467 rows.
Each row of the original matrix has mutational information for a patient. This means that the same patient with the same ID might appear many times. The second matrix contains the patient information, so no patients are repeated in that matrix. When merging the matrix, I want to make sure that every patient mutation (each row) is matched with its corresponding information from the merging matrix.
My sample code:
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'sorted_samples_2.txt';
open(INFILE, $file) or die "Can't open file: $!\n";
open(my $outfile, '>', 'sorted_samples_changed.txt');
foreach my $line (<INFILE>) {
print "The input line is $line\n";
my #columns = split('\t', $line);
($columns[15]) = $columns[15]=~/:((\w\w\w\w-\w\d-\w|\w\w-\d\d\w)+)$/;
printf $outfile "#columns/n";
}
Issues: The code deletes the header and deleted the string in column 16.
A few issues about your code:
Good job on include use strict; and use warnings;. Keep doing that
Anytime you're doing file or directory processing, include use autodie; as well.
Always use lexical file handles $infh instead of globs INFILE.
Use the 3 parameter form of open.
Always process a file line by line using a while loop. Using a for loop loads the entire file into memory
Don't forget to chomp your input from a file.
Use the line number variable $. if you want special logic for your header
The first parameter of split is a pattern. Use /\t/. The only exception to this is ' ' which has special meaning. Currently your introducing a bug by using a single quoted string.
When altering a value with a regex, try to focus on what you DO want instead of what you DON'T. In this case it looks like you want 4 groups separated by dashes, and then truncate the rest. Focus on matching those groups.
Don't use printf when you mean print.
The following applies these fixes to your script:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $infile = 'sorted_samples_2.txt';
my $outfile = 'sorted_samples_changed.txt';
open my $infh, '<', $infile;
open my $outfh, '>', $outfile;
while (my $line = <$infh>) {
chomp $line;
my #columns = split /\t/, $line;
if ($. > 1) {
$columns[15] =~ s/^(\w{4}-\w\d-\w{4}-\w{3}).*/$1/
or warn "Unable to fix column at line $.";
}
print $outfh join("\t", #columns), "\n";
}
You need to define scope for your variables with 'my' in declaration itself when you use 'use strict'.
In your case, you should use my #sort = sort {....} in first line and
you should have an array reference $t defined somewhere to de-reference it in second line. You don't have #array declared anywhere in this code, that is the reason you got all those errors. Make sure you understand what you are doing before you do it.

Counting records separated by CR/LF (carriage return and newline) in Perl

I'm trying to create a simple script to read a text file that contains records of book titles. Each record is separated with a plain old double space (\r\n\r\n). I need to count how many records are in the file.
For example here is the input file:
record 1
some text
record 2
some text
...
I'm using a regex to check for carriage return and newline, but it fails to match. What am I doing wrong? I'm at my wits' end.
sub readInputFile {
my $inputFile = $_[0]; #read first argument from the commandline as fileName
open INPUTFILE, "+<", $inputFile or die $!; #Open File
my $singleLine;
my #singleRecord;
my $recordCounter = 0;
while (<INPUTFILE>) { # loop through the input file line-by-line
$singleLine = $_;
push(#singleRecord, $singleLine); # start adding each line to a record array
if ($singleLine =~ m/\r\n/) { # check for carriage return and new line
$recordCounter += 1;
createHashTable(#singleRecord); # send record make a hash table
#singleRecord = (); # empty the current record to start a new record
}
}
print "total records : $recordCounter \n";
close(INPUTFILE);
}
It sounds like you are processing a Windows text file on Linux, in which case you want to open the file with the :crlf layer, which will convert all CRLF line-endings to the standard Perl \n ending.
If you are reading Windows files on a Windows platform then the conversion is already done for you, and you won't find CRLF sequences in the data you have read. If you are reading a Linux file then there are no CR characters in there anyway.
It also sounds like your records are separated by a blank line. Setting the built-in input record separator variable $/ to a null string will cause Perl to read a whole record at a time.
I believe this version of your subroutine is what you need. Note that people familiar with Perl will thank you for using lower-case letters and underscore for variables and subroutine names. Mixed case is conventionally reserved for package names.
You don't show create_hash_table so I can't tell what data it needs. I have chomped and split the record into lines, and passed a list of the lines in the record with the newlines removed. It would probably be better to pass the entire record as a single string and leave create_hash_table to process it as required.
sub read_input_file {
my ($input_file) = #_;
open my $fh, '<:crlf', $input_file or die $!;
local $/ = '';
my $record_counter = 0;
while (my $record = <$fh>) {
chomp;
++$record_counter;
create_hash_table(split /\n/, $record);
}
close $fh;
print "Total records : $record_counter\n";
}
You can do this more succinctly by changing Perl's record-separator, which will make the loop return a record at a time instead of a line at a time.
E.g. after opening your file:
local $/ = "\r\n\r\n";
my $recordCounter = 0;
$recordCounter++ while(<INPUTFILE>);
$/ holds Perl's global record-separator, and scoping it with local allows you to override its value temporarily until the end of the enclosing block, when it will automatically revert back to its previous value.
But it sounds like the file you're processing may actually have "\n\n" record-separators, or even "\r\r". You'd need to set the record-separator correctly for whatever file you're processing.
If your files are not huge multi-gigabytes files, the easiest and safest way is to read the whole file, and use the generic newline metacharacter \R.
This way, it also works if some file actually uses LF instead of CRLF (or even the old Mac standard CR).
Use it with split if you also need the actual records:
perl -ln -0777 -e 'my #records = split /\R\R/; print scalar(#records)' $Your_File
Or if you only want to count the records:
perl -ln -0777 -e 'my $count=()=/\R\R/g; print $count' $Your_File
For more details, see also my other answer here to a similar question.

Extracting unique values from multiple files in Perl

I have several data files that are tab delimited. I need to extract all the unique values in a certain column of these data files (say column 25) and write these values into an output file for further processing. How might I do this in Perl? Remember I need to consider multiple files in the same folder.
edit: The code I've done thus far is like this.
#!/usr/bin/perl
use warnings;
use strict;
my #hhfilelist = glob "*.hh3";
for my $f (#hhfilelist) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
chomp;
my #line = split /\t/;
print "field is $line[24]\n";
}
close (F);
}
The question is how do I efficiently create the hash/array of unique values as I read each line of each file. Or is it faster if I populate the whole array and then remove duplicates?
Some tips on how to handle the problem:
Find files
For finding files within a directory, use glob: glob '.* *'
For finding files within a directory tree, use File::Find's find function
Open each file, use Text::CSV with \t character as the delimiter, extract wanted values and write to file
For Perl solution, please use Text::CSV module to parse flat (X-separated) files - the constructor accepts a parameter specifying a separator character. Do this for every file in a loop, with file list generated by either glob() for files in a given directory or File::Find for subdirectories as well
Then, to get the unique values, for each row, store the column #25 in a hash.
E.g. after retrieving the values:
$colref = $csv->getline($io);
$unique_values_hash{ $colref->[24] } = 1;
Then, iterate over hash keys and print to a file.
For non-Perl shell solution, you can simply do:
cat MyFile_pattern | awk -F'\t' 'print $25' |sort -u > MyUniqueValuesFile
You can replace awk with cut
Please note that non-Perl solution only works if the files don't contain TABs in the fields themselves and the columns aren't quoted.
perl -F/\\t/ -ane 'print"$F[24]\n" unless $seen{$F[24]}++' inputs > output
perl -F/\\t/ -ane 'print"$F[24]\n" unless $seen{$F[24]}++' *.hh3 > output
Command-line switches -F/\\t/ -an mean iterate through every line in every input file and split the line on the tab character into the array #F.
$F[24] refers to the value in the 25-th field of each line (between the 24-th and 25-th tab characters)
$seen{...} is a hashtable to keep track of which values have already been observed.
The first time a value is observed, $seen{VALUE} is 0 so Perl will execute the statement print"$F[24]\n". Every other time the value is observed, $seen{VALUE} will be non-zero and the statement won't be executed. This way each unique value gets printed out exactly once.
In a similar context to your larger script:
my #hhfilelist = glob "*.hh3";
my %values_in_field_25 = ();
for my $f (#hhfilelist) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
my #F = split /\t/;
$values_in_field_25{$F[24]} = 1;
}
close (F);
}
my #unique_values_in_field_25 = keys %values_in_field_25; # or sort keys ...