I am using gcloud beta logging read to read some logs and am using the --format option to format as csv:
--format="csv(timestamp,jsonPayload.message)"
which works fine.
gcloud topic formats suggests I can specify the separator for my CSV output (i'd like to specify ", " so that the entries are spaces out a little) but I can't figure out the syntax for specifying the separator. I've tried the following but neither are correct:
--format="csv(timestamp,jsonPayload.message),separator=', '"
--format="csv(timestamp,jsonPayload.message)" --separator=", "
Does anyone know how to do this?
thx
Never mind, I figured it out.
--format="csv[separator=', '](timestamp,jsonPayload.message)"
Related
I tried to read a CSV file with pyspark with the following line in it:
2100,"Apple Mac Air A1465 11.6"" Laptop - MD/B (Apr, 2014)",Apple MacBook
My code for reading:
df = spark.read.options(header='true', inferschema='true').csv(file_path)
And the df splits the second component at the middle:
first component: 2100
second component: "Apple Mac Air A1465 11.6"" Laptop - MD/B (Apr,
Third component: 2014)"
Meaning that the second original component was split into two components.
I tried several more syntaxes (databricks, sql context etc. ) but all had the same result.
What is the reason for that? How could I fix it?
For this type of scenarios spark has provided a great solution i.e. escape option.
just add escape =' " ' in options. you will get 3 components as shown below.
df= spark.read.options(header='true', inferschema='true',escape='"').csv("file:///home/srikarthik/av.txt")
This is happening because file seperator is comma(',').
So write a code such that it will ignore comma when it comes between " and "
otherwise second solution-you read the file as it is without column header.then replace comma with */any other punctuation when it comes bet " ".and then save the file then read using comma as seperator it will work
I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this:
Line1field1;Line1field2.1 \
Line1field2.2;Line1field3;
Line2FIeld1;Line2field2;Line2field3;
I've tried to read it using sc.textFile("file.csv") and using sqlContext.read.format("..databricks..").option("escape/delimiter/...").load("file.csv")
However doesn't matter how I read it, a record/line/row is created when "\ \n" si reached. So, instead of having 2 records from the previous file, I am getting three:
[Line1field1,Line1field2.1,null] (3 fields)
[Line1field.2,Line1field3,null] (3 fields)
[Line2FIeld1,Line2field2,Line2field3;] (3 fields)
The expected result is:
[Line1field1,Line1field2.1 Line1field.2,Line1field3] (3 fields)
[Line2FIeld1,Line2field2,Line2field3] (3 fields)
(How the newline symbol is saved in the record is not that important, main issue is having the correct set of records/lines)
Any ideas of how to be able to do that? Without modifying the original file and preferably without any post/re processing (for example reading the file and filtering any lines with a lower number of fields than expected and the concatenating them could be a solution, but not at all optimal)
My hope was to use databrick's csv parser to set the escape character to \ (which is supposed to be by default), but that didn't work [got an error saying
java.io.IOException: EOF whilst processing escape sequence].
Should I somehow extend the parser and edit something, creating my own parser? Which would be the best solution?
Thanks!
EDIT: Forgot to mention, i'm using spark 1.6
wholeTextFiles api should be a rescuer api in your case. It read files as key, value pairs : key as the path of the file and value as the whole text of the file. You will have to do some replacements and splittings to get the desired output though
val rdd = sparkSession.sparkContext.wholeTextFiles("path to the file")
.flatMap(x => x._2.replace("\\\n", "").replace(";\n", "\n").split("\n"))
.map(x => x.split(";"))
the rdd output is
[Line1field1,Line1field2.1 Line1field2.2,Line1field3]
[Line2FIeld1,Line2field2,Line2field3]
I have been trying to import several csv files into MongoDB using the Mongoimport tool. The thing is that despite what the name says in several countries the csv files are saved with semi-colons instead of commas making me unable to use the mongoimport tool properly.
There are some workarounds for this by changing the delimiter option in the region settings, however for several reasons I don't have access to the machine that generates this csv files so I can't do that.
I was wondering is there any way to import this csv files using the mongo tools instead of me having to write something to replace all the semi-colons on a file with commas? Since I find pretty strange mongo overlooking that in some countries semi-colons are used.
mongodb supports tsv then we should replace ";" by "\t" :
tr ";" "\t" < file.csv | mongoimport --type tsv ...
It looks like this is not supported,I can't find the option to specify a delimiter among the allowed arguments for 'mongoimport' on document page http://docs.mongodb.org/manual/reference/program/mongoimport/#bin.mongoimport .
You can file a feature request on jira if it's something you'd like to
see supported.
I have been searching but so far I only found how to insert date into tables based on a csv files.
I have the following scenario:
Directory name = ticketID
Inside this directory I have a couple of files, like:
Description.txt
Summary.txt - Contains ticket header and has been imported succefully.
Progress_#.txt - this is everytime a ticket gets udpdated. I get a new file.
Solution.txt
Importing the Issue.txt was easy since this was actually a CSV.
Now my problem is with Description and Progress files.
I need to update the existing rows with the data from this files. Something on the line of
update table_ticket set table_ticket.description = Description.txt where ticket_number = directoryname
I'm using PostgreSQL and the COPY command is valid for new data and it would still fail due to the ',;/ special chars.
I wanted to do this using bash script, but it seem that it is it won't be possible:
for i in `find . -type d`
do
update table_ticket
set table_ticket.description = $i/Description.txt
where ticket_number = $i
done
Of course the above code would take into consideration connection to the database.
Anyone has a idea on how I could achieve this using shell script. Or would it be better to just make something in Java and read and update the record, although I would like to avoid this approach.
Thanks
Alex
Thanks for the answer, but I came across this:
psql -U dbuser -h dbhost db
\set content = `cat PATH/Description.txt`
update table_ticket set description = :'content' where ticketnr = TICKETNR;
Putting this into a simple script I created the following:
#!/bin/bash
for i in `find . -type d|grep ^./CS`
do
p=`echo $i|cut -b3-12 -`
echo $p
sed s/PATH/${p}/g cmd.sql > cmd.tmp.sql
ticketnr=`echo $p|cut -b5-10 -`
sed -i s/TICKETNR/${ticketnr}/g cmd.tmp.sql
cat cmd.tmp.sql
psql -U supportAdmin -h localhost supportdb -f cmd.tmp.sql
done
The downside is that it will create always a new connection, later I'll change to create a single file
But it does exactly what I was looking for, putting the contents inside a single column.
psql can't read the file in for you directly unless you intend to store it as a large object in which case you can use lo_import. See the psql command \lo_import.
Update: #AlexandreAlves points out that you can actually slurp file content in using
\set myvar = `cat somefile`
then reference it as a psql variable with :'myvar'. Handy.
While it's possible to read the file in using the shell and feed it to psql it's going to be awkward at best as the shell offers neither a native PostgreSQL database driver with parameterised query support nor any text escaping functions. You'd have to roll your own string escaping.
Even then, you need to know that the text encoding of the input file is valid for your client_encoding otherwise you'll insert garbage and/or get errors. It quickly lands up being easier to do it in a langage with proper integration with PostgreSQL like Python, Perl, Ruby or Java.
There is a way to do what you want in bash if you really must, though: use Pg's delimited dollar quoting with a randomized delimiter to help prevent SQL injection attacks. It's not perfect but it's pretty darn close. I'm writing an example now.
Given problematic file:
$ cat > difficult.txt <__END__
Shell metacharacters like: $!(){}*?"'
SQL-significant characters like "'()
__END__
and sample table:
psql -c 'CREATE TABLE testfile(filecontent text not null);'
You can:
#!/bin/bash
filetoread=$1
sep=$(printf '%04x%04x\n' $RANDOM $RANDOM)
psql <<__END__
INSERT INTO testfile(filecontent) VALUES (
\$x${sep}\$$(cat ${filetoread})\$x${sep}\$
);
__END__
This could be a little hard to read and the random string generation is bash specific, though I'm sure there are probably portable approaches.
A random tag string consisting of alphanumeric characters (I used hex for convenience) is generated and stored in seq.
psql is then invoked with a here-document tag that isn't quoted. The lack of quoting is important, as <<'__END__' would tell bash not to interpret shell metacharacters within the string, wheras plain <<__END__ allows the shell to interpret them. We need the shell to interpret metacharacters as we need to substitute sep into the here document and also need to use $(...) (equivalent to backticks) to insert the file text. The x before each substitution of seq is there because here-document tags must be valid PostgreSQL identifiers so they must start with a letter not a number. There's an escaped dollar sign at the start and end of each tag because PostgreSQL dollar quotes are of the form $taghere$quoted text$taghere$.
So when the script is invoked as bash testscript.sh difficult.txt the here document lands up expanding into something like:
INSERT INTO testfile(filecontent) VALUES (
$x0a305c82$Shell metacharacters like: $!(){}*?"'
SQL-significant characters like "'()$x0a305c82$
);
where the tags vary each time, making SQL injection exploits that rely on prematurely ending the quoting difficult.
I still advise you to use a real scripting language, but this shows that it is indeed possible.
The best thing to do is to create a temporary table, COPY those from the files in question, and then run your updates.
Your secondary option would be to create a function in a language like pl/perlu and do this in the stored procedure, but you will lose a lot of performance optimizations that you can do when you update from a temp table.
I am generating CSV, and I want to store numbers without exponential format.
Please give me some suggestion.
I tried:
I used , perfectly,
I tried single quote before the large number, so I got as expected out in CSV, but number fore it showed single quote, if I click that number then number displaying perfectly.
I tried with delimeter, that is ' quote before one trailing slash.
So far no luck.
You might try using something like the Math::BigInt module:
use Math::BigInt;
my $num = new Math::BigInt(2);
$num=$num**128;
print "$num\n";
which will output:
340282366920938463463374607431768211456