Bulk insert csv - column value itself has commas - tsql

The Activity column has values that have commas in the text
Some records in the csv :
Name,Project,Activity,Hrs
John,,,7.1
,Junkie,,7.1
,,Reviewing the,file,7.1 //This is under 'Activity' column and it has a comma in the text
When I use the bulk insert, i get 'bulk load data conversion for this line. If this line is removed or the comma in that sentence is removed, it works all good.
Please let me know what are the options in this case. I have many csv files, and each might have many such values.

If I had this particular issue and the creation of the CSV files was not under my control, I would resort to a Perl script like this:
open(my $fhin, "<", "MyFile.csv");
open(my $fhout, ">", "MyQFile.csv");
while (my $line = <$fh>) {
chomp($line);
$line =~ s/^([^,]*),([^,]*),(.*),([^,]*)$/\"$1\",\"$2\",\"$3\",\"$4\"/;
print $fhout $line . "\n";
}
Note that the above regular expression can handle only one "problem" column of this kind. If there are any others, there is no possibility of programmatically assigning correct quotation to such columns (without more information...).

I had a similar issue where a text string had a comma in the string. I used the following field terminator to resolve.
FIELDTERMINATOR = '\t'
This does not work on CSV and I had to save my files as .txt.

Related

Removing extra commas from csv file in perl

I have a multiple CSV files each with a different amount of entries each with roughly 300 lines each.
The first line in each file is the Data labels
Person_id, person_name, person_email, person_address, person_recruitmentID, person_comments... etc
The Rest of the lines in each file contain the data
"0001", "bailey", "123 fake, street", "bailey#mail.com", "0001", "this guy doesnt know how to get rid of, commas!"... etc
I want to get rid of commas that are in between quotation marks.
I'm currently going through the Text::CSV documentation but its a slow process.
A good CSV parser will have no trouble with this since commas are inside the quoted fields, so you can simply parse the file with it.
A really nice module is Text::CSV_XS, which is loaded by default when you use the wrapper Text::CSV. The only thing to address in your data is the spaces between fields since they aren't in CSV specs, so I use the option for that in the example below.
If you indeed must remove commas for further work do that as the parser hands you lines.
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = 'commas_in_fields.csv';
my $csv = Text::CSV->new( { binary => 1, allow_whitespace => 1 } )
or die "Cannot use CSV: " . Text::CSV->error_diag ();
open my $fh, '<', $file or die "Can't open $file: $!";
my #headers = #{ $csv->getline($fh) }; # if there is a separate header line
while (my $line = $csv->getline($fh)) { # returns arrayref
tr/,//d for #$line; # delete commas from each field
say "#$line";
}
This uses tr on $_ in the for loop, thus changing the elements of the array iterated over themselves, for conciseness.
I'd like to repeat and emphasize what others have explained: do not parse CSV by hand, since trouble may await; use a library. This is akin to parsing XML and similar formats: no regex please, but libraries.
Let's get this out of the way: you cannot read a CSV by just splitting on commas. You've just demonstrated why; commas might be escaped or inside quotes. Those commas are totally valid, they're part of the data. Discarding them mangles the data in the CSV.
For this reason, and others, CSV files must be read using a CSV parsing library. To find which commas are data and which commas are structural also requires parsing the CSV using a CSV parsing library. So you won't be saving yourself any time by trying to remove the commas from inside quotes. Instead you'll give yourself more work while mangling the data. You'll have to use a CSV parsing library.
Text::CSV_XS is a very good, very fast CSV parsing library. It has a ton of features, most of which you do not need. Fortunately it has examples for doing most common actions.
For example, here's how you read and print each row from a file called file.csv.
use strict;
use warnings;
use autodie;
use v5.10; # for `say`
use Text::CSV_XS;
# Open the file.
open my $fh, "<", "file.csv";
# Create a new Text::CSV_XS object.
# allow_whitespace allows there to be whitespace between the fields
my $csv = Text::CSV_XS->new({
allow_whitespace => 1
});
# Read in the header line so it's not counted as data.
# Then you can use $csv->getline_hr() to read each row in as a hash.
$csv->header($fh);
# Read each row.
while( my $row = $csv->getline($fh) ) {
# Do whatever you want with the list of cells in $row.
# This prints them separated by semicolons.
say join "; ", #$row;
}

how to replace , to . in large text file

I wish to use perl and write a program that looks for latitude and longitude values in a large tab delimited text file (100000 rows), and replaces the , used in the lat long values to a . . the file has multiple columns.
ie. I want to change it to.
51,2356 to 51.2356
can someone show me how this is done?
many thanks,
You don't need a "program" for that, you do things like this with one liners really. If you want to replace ALL , (commas) in file with . (dots) in entire file (your question doesn't go into specifics of original file format) then below does the trick:
perl -pi.bak -e 's/,/\./g;' your_file.txt
It will also backup your file before running replace to your_file.txt.bak.
Quick and dirty way is to replace ALL commas by dots ( if that will serve your requirement)
while (<$fh>) {
$_ =~ s/,/\./g
}
However, if there are other fields which might be affected, better solution would be to replace only the desired columns.
Assuming the two fields that you're interested are the first and second column of the tab-delimited file, ideally the two fields should be matched and then replaced with "dot".
while (<$fh>) {
if ($_ =~ /(\d+,\d+)\|(\d+,\d+)\|(.*)$/) {
my ($a, $b, $c) = ($1, $2, $3);
$a=~s/,/\./g;
$b=~s/,/\./g;
$_= $a."|".$b."|".$c;
}
}
Here,
$a and $b will match the "latitude and longitude" columns.

Remove CRLF end of csv file using Perl

I have a csv file ending with CRLF in each row record. Using Perl, how to remove the CRLF only in the final record of the file so that there is no empty record row at the end of file? Thank you.
If I follow the question correctly, there is a line feed trailing the last record, which is creating an empty row at the end of the file.
You can read the file into a scalar and remove the trailing, blank row with a substitution. \R will work on new Perl versions (5.10, I think) and will match any system's line break, otherwise you'll need to use \n or \n\r
open $fh, '<', 'test.csv';
while (<$fh>) {
$str .= $_;
}
$str =~ s/\R+(.*)\R+/$1/s;

Counting records separated by CR/LF (carriage return and newline) in Perl

I'm trying to create a simple script to read a text file that contains records of book titles. Each record is separated with a plain old double space (\r\n\r\n). I need to count how many records are in the file.
For example here is the input file:
record 1
some text
record 2
some text
...
I'm using a regex to check for carriage return and newline, but it fails to match. What am I doing wrong? I'm at my wits' end.
sub readInputFile {
my $inputFile = $_[0]; #read first argument from the commandline as fileName
open INPUTFILE, "+<", $inputFile or die $!; #Open File
my $singleLine;
my #singleRecord;
my $recordCounter = 0;
while (<INPUTFILE>) { # loop through the input file line-by-line
$singleLine = $_;
push(#singleRecord, $singleLine); # start adding each line to a record array
if ($singleLine =~ m/\r\n/) { # check for carriage return and new line
$recordCounter += 1;
createHashTable(#singleRecord); # send record make a hash table
#singleRecord = (); # empty the current record to start a new record
}
}
print "total records : $recordCounter \n";
close(INPUTFILE);
}
It sounds like you are processing a Windows text file on Linux, in which case you want to open the file with the :crlf layer, which will convert all CRLF line-endings to the standard Perl \n ending.
If you are reading Windows files on a Windows platform then the conversion is already done for you, and you won't find CRLF sequences in the data you have read. If you are reading a Linux file then there are no CR characters in there anyway.
It also sounds like your records are separated by a blank line. Setting the built-in input record separator variable $/ to a null string will cause Perl to read a whole record at a time.
I believe this version of your subroutine is what you need. Note that people familiar with Perl will thank you for using lower-case letters and underscore for variables and subroutine names. Mixed case is conventionally reserved for package names.
You don't show create_hash_table so I can't tell what data it needs. I have chomped and split the record into lines, and passed a list of the lines in the record with the newlines removed. It would probably be better to pass the entire record as a single string and leave create_hash_table to process it as required.
sub read_input_file {
my ($input_file) = #_;
open my $fh, '<:crlf', $input_file or die $!;
local $/ = '';
my $record_counter = 0;
while (my $record = <$fh>) {
chomp;
++$record_counter;
create_hash_table(split /\n/, $record);
}
close $fh;
print "Total records : $record_counter\n";
}
You can do this more succinctly by changing Perl's record-separator, which will make the loop return a record at a time instead of a line at a time.
E.g. after opening your file:
local $/ = "\r\n\r\n";
my $recordCounter = 0;
$recordCounter++ while(<INPUTFILE>);
$/ holds Perl's global record-separator, and scoping it with local allows you to override its value temporarily until the end of the enclosing block, when it will automatically revert back to its previous value.
But it sounds like the file you're processing may actually have "\n\n" record-separators, or even "\r\r". You'd need to set the record-separator correctly for whatever file you're processing.
If your files are not huge multi-gigabytes files, the easiest and safest way is to read the whole file, and use the generic newline metacharacter \R.
This way, it also works if some file actually uses LF instead of CRLF (or even the old Mac standard CR).
Use it with split if you also need the actual records:
perl -ln -0777 -e 'my #records = split /\R\R/; print scalar(#records)' $Your_File
Or if you only want to count the records:
perl -ln -0777 -e 'my $count=()=/\R\R/g; print $count' $Your_File
For more details, see also my other answer here to a similar question.

removing double quotes from a file

I have a tab delimited file, that looks like this.
"""chr1" "38045559" "38046059" "C1orf122"""
"""" "" "" "C1orf122"""
"""" "" "" "YRDC"""
"""chr1" "205291045" "205291545" "YOD1"""
"""chr1" "1499717" "1500625" "SSU72"""
I got this file after converting a .csv to tab separated file from this command
perl -lpe 's/"/""/g; s/^|$/"/g; s/","/\t/g' <test.csv>test_tab
Now, I want my file to remain tab separated but all the extra quotes should be removed from the file. But at the same time when I print column 4 I should get all the names and for column 1,2, and 3 the co ordinates (this I still get it but with quotes).
What manipulation shall I do in above command to do so, kindly guide.
The output desired is (since I was asked to be clear)
chr1 38045559 38046059 C1orf122
C1orf122
YRDC
chr1 205291045 205291545 YOD1
chr1 1499717 1500625 SSU72
so that when I extract Column 4 I should get
C1orf122
C1orf122
YRDC
YOD1
SSU72
Thank you
It appears that most of those quotes are being inserted by your command to bring in the file. Instead open the file normally:
use strict;
use warnings;
open CSV, 'test.csv' or die "can't open input file.";
open TAB, '>test.tab' or die "can't open output file.";
my #row_array;
while (<CSV>)
{
#Remove any quotes that exist on the line (it is in default variable $_).
s/"//g;
#Split the current row into an array.
my #fields = split /,/;
#write the output, tab-delimited file.
print TAB join ("\t", #fields) . "\n";
#Put the row into a multidimensional array.
push #row_array, \#fields;
}
print "Column 4:\n";
print $_->[3] . "\n" foreach (#row_array);
print "\nColumns 1-3:\n";
print "#{$_}[0..2]\n" foreach (#row_array);
Any quotes that still do exist will be removed by s/"//g; in the above code. This will remove all quotes; it doesn't check whether they are at the beginning and end of a field. If you might have some quotes within the data that you need to preserve, you would need a more sophisticated matching pattern.
Update: I added code to create a tab-separated output file, since you seem to want that. I don't understand exactly what your requirement related to getting "all the names...and the coordinates" is. However, you should be able to use the above code for that. Just add what you need where it says "do stuff". You can reference, for example, column 1 with $fields[0].
Update 2: Added code to extract column 4, then columns 1-3. The syntax for using multidimensional arrays is tricky. See perldsc and perlref for more information.
Update 3: Added code to remove the quotes that still exist in your file.