I have a table with 3 columns - type, name, and code.
The code column contains the procedure/function source code.
I have exported it to a csv file using Import/Export option in PgAdmin 4 v5, but the code column does not stick to a single cell in the csv file. The data in it spreads across to many of the rows and columns.
I have checked Encoding as UTF8 which works fine normally while exporting other tables.
Other settings: Format: csv, Encoding: UTF8. Have not changed any other settings
Can someone help how to export it properly.
An explanation of what you are seeing:
CREATE TABLE public.csv_test (
fld_1 character varying,
fld_2 character varying,
fld_3 character varying,
fld_4 character varying
);
insert into csv_test values ('1', E'line with line end. \n New line', 'test', 'dog');
insert into csv_test values ('2', E'line with line end. \n New line', 'test', 'dog');
insert into csv_test values ('3', E'line with line end. \n New line \n Another line', 'test2', 'cat');
insert into csv_test values ('4', E'line with line end. \n New line \n \t Another line', 'test3', 'cat');
select * from csv_test ;
fld_1 | fld_2 | fld_3 | fld_4
-------+-----------------------+-------+-------
1 | line with line end. +| test | dog
| New line | |
2 | line with line end. +| test | dog
| New line | |
3 | line with line end. +| test2 | cat
| New line +| |
| Another line | |
4 | line with line end. +| test3 | cat
| New line +| |
| Another line | |
\copy csv_test to csv_test.csv with (format 'csv');
\copy csv_test to csv_test.txt;
--fld_2 has line ends and/or tabs so in CSV the data will wrap inside the quotes.
cat csv_test.csv
1,"line with line end.
New line",test,dog
2,"line with line end.
New line",test,dog
3,"line with line end.
New line
Another line",test2,cat
4,"line with line end.
New line
Another line",test3,cat
-- In text format the line ends and tabs are shown and not wrapped.
cat csv_test.txt
1 line with line end. \n New line test dog
2 line with line end. \n New line test dog
3 line with line end. \n New line \n Another line test2 cat
4 line with line end. \n New line \n \t Another line test3 cat
Related
I have the url https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r in dataset. I want to remove https:// at the start of the string and \r at the end of the string.
Creating dataframe to replicate the issue
c = spark.createDataFrame([('https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r',)], ['str'])
I tried below regexp_replace with pipe function. But it is not working as expected.
c.select(F.regexp_replace('str', 'https:// | \\r', '')).first()
Actual output:
www.youcuomizei.comEquaion-Kid-Backack-Peronalized301793
Expected output:
www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793
the "backslash"r (\r) is not showing in your original spark.createDataFrame object because you have to escape it. so your spark.createDataFrame should be. please note the double backslashes
c = spark.createDataFrame([("https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\\r",)], ['str'])
which will give this output:
+------------------------------------------------------------------------------+
|str |
+------------------------------------------------------------------------------+
|https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r|
+------------------------------------------------------------------------------+
your regex https://|[\\r] will not remove the \r . the regex should be
c = (c
.withColumn("str", F.regexp_replace("str", "https://|[\\\\]r", ""))
)
which will give this output:
+--------------------------------------------------------------------+
|str |
+--------------------------------------------------------------------+
|www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793|
+--------------------------------------------------------------------+
I have data with very odd delimiters:
1,|ABC1|,|BUD|,|Fed Budget & Appropriations|,|t1|
2,|ABC2|,|LBR|,|Labor, Antitrust & Workplace|,|t2|
3,|ABC3|,|UNM|,|Unemployment|,|t1|
So the delimiter is a comma and each variable, but the first one (the identifier) is between two pipes. The problem is that the fourth variable also uses commas, so I can't simply use commas as delimiters and delete the pipes. I have found a way to work the data by doing some find and replace operations through the terminal, but I would like to do this through Stata. Does anyone have an idea how to?
I put your data example into a text file and found that the delimiters were detected quite well automatically. Then I dropped any variable that was all commas or all missing, using findname from the Stata Journal.
. import delimited "troublesome.txt"
(9 vars, 3 obs)
. list
+-------------------------------------------------------------------------+
| v1 v2 v3 v4 v5 v6 v7 v8 v9 |
|-------------------------------------------------------------------------|
1. | 1, ABC1 , BUD , Fed Budget & Appropriations , t1 . |
2. | 2, ABC2 , LBR , Labor, Antitrust & Workplace , t2 . |
3. | 3, ABC3 , UNM , Unemployment , t1 . |
+-------------------------------------------------------------------------+
. findname, all(# == ",")
v3 v5 v7
. drop `r(varlist)'
. findname, all(missing(#))
v9
. drop `r(varlist)'
. destring v1, ignore(",") replace
v1: character , removed; replaced as byte
. list
+-----------------------------------------------------+
| v1 v2 v4 v6 v8 |
|-----------------------------------------------------|
1. | 1 ABC1 BUD Fed Budget & Appropriations t1 |
2. | 2 ABC2 LBR Labor, Antitrust & Workplace t2 |
3. | 3 ABC3 UNM Unemployment t1 |
+-----------------------------------------------------+
I am trying to split a string column of a dataframe in spark based on a delimiter ":|:|:"
Input:
TEST:|:|:51:|:|:PHT054008056
Test code:
dataframe1
.withColumn("splitColumn", split(col("testcolumn"), ":|:|:"))
Result:
+------------------------------+
|splitColumn |
+------------------------------+
|[TEST, |, |, 51, |, |, P] |
+------------------------------+
Test code:
dataframe1
.withColumn("part1", split(col("testcolumn"), ":|:|:").getItem(0))
.withColumn("part2", split(col("testcolumn"), ":|:|:").getItem(3))
.withColumn("part3", split(col("testcolumn"), ":|:|:").getItem(6))
part1 and part2 work correctly.
part3 only has 2 characters and rest of the string is truncated.
part3:
P
I want to get the entire part3 string.
Any help is appreciated.
You're almost there – just need to escape | within your delimiter, as follows:
val df = Seq(
(1, "TEST:|:|:51:|:|:PHT054008056"),
(2, "TEST:|:|:52:|:|:PHT053007057")
).toDF("id", "testcolumn")
df.withColumn("part3", split($"testcolumn", ":\\|:\\|:").getItem(2)).show
// +---+--------------------+------------+
// | id| testcolumn| part3|
// +---+--------------------+------------+
// | 1|TEST:|:|:51:|:|:P...|PHT054008056|
// | 2|TEST:|:|:52:|:|:P...|PHT053007057|
// +---+--------------------+------------+
[UPDATE]
You could also use triple quotes for the delimiter, in which case you still have to escape | to indicate it's a literal pipe (not or in Regex):
df.withColumn("part3", split($"testcolumn", """:\|:\|:""").getItem(2)).show
Note that with triple quotes, you need only a single escape character \, whereas without the triple quotes the escape character itself needs to be escaped (hence \\).
My table structure is
company=# \d address
Table "public.address"
Column | Type | Modifiers
----------+-----------------------+-----------
name | character varying(80) |
age | integer |
dob | date |
village | character varying(8) |
locality | character varying(80) |
district | character varying(80) |
state | character varying(80) |
pin | integer |
and i have following data in the flat file(*.txt file).
insert into address(name,age,dob,village,locality,district,state,pin)
values('David',43,'1972-10-23','Elchuru','Addanki','Prakasam','AP',544421);
insert into address(name,age,dob,village,locality,district,state,pin)
values('George',53,'1962-10-23','London','London','LN','LN',544421);
insert into address(name,age,dob,village,locality,district,state,pin)
values('David',28,'1982-10-23','Ongole','Ongole','Prakasam','AP',520421);
Now I am trying load into my table 'address' using following query i psql shell.
copy address from 'C:/P Files/address_data.txt';
Error is:
company=# copy address from 'C:/P Files/address_data.txt';
ERROR: value too long for type character varying(80)
CONTEXT: COPY address, line 1, column name: "insert into address(name,age,dob,village,locality,district,state,pin) values('David',43,'1972-10-23'..."
Please suggest modifications to be done in the above query
You don't have a data file. You have a file with a set of commands.
You can use the psql command to execute the inserts.
A data file would look more like this:
David,43,1972-10-23,Elchuru,Addanki,Prakasam,AP,544421
George,53,1962-10-23,London,London,LN,LN,544421
David,28,1982-10-23,Ongole,Ongole,Prakasam,AP,520421
I have an odd dataset that I need to import into SAS, splitting the records into two tables depending on formatting, and dropping some records altogether. The data is structured as follows:
c Comment line 1
c Comment line 2
t lines init
a 'mme006' M 8 99 15 '111 ME - RANDOLPH ST'
path=no
dwt=0.01 42427 ttf=1 us1=3 us2=0
dwt=#0 42350 ttf=1 us1=1.8 us2=0 lay=3
dwt=>0 42352 ttf=1 us1=0.5 us2=18.13
42349 lay=3
a 'mme007' M 8 99 15 '111 ME - RANDOLPH ST'
path=no
dwt=+0 42367 ttf=1 us1=0.6 us2=0
dwt=0.01 42368 ttf=1 us1=0.6 us2=35.63 lay=3
dwt=#0 42369 ttf=1 us1=0.3 us2=0
42381 lay=3
Only the lines beginning with a, dwt or an integer need to be kept.
For the lines beginning with a, the desired output is a table like this, called "lines", which contains the first two non-a values in the row:
name | type
--------+------
mme006 | M
mme007 | M
For the dwt/integer rows, the table "itins" would look like so:
anode | dwt | ttf | us1 | us2 | lay
------+------+-----+-----+-------+-----
42427 | 0.01 | 1 | 3.0 | 0.00 |
42350 | #0 | 1 | 1.8 | 0.00 | 3
42352 | >0 | 1 | 0.5 | 18.13 |
42349 | | | | | 3 <-- line starting with integer
42367 | +0 | 1 | 0.6 | 0.00 |
42368 | 0.01 | 1 | 0.6 | 35.63 | 3
42369 | #0 | 1 | 0.3 | 0.00 |
42381 | | | | | 3 <-- line starting with integer
The code I have so far is almost there, but not quite:
data lines itins;
infile in1 missover;
input #1 first $1. #;
if first in ('c','t') then delete;
else if first='a' then do;
input name $ type $;
output lines; end;
else do;
input #1 path=$ dwt=$ anode ttf= us1= us2= us3= lay=;
if path='no' then delete;
output itins; end;
The problems:
The "lines" table is correct, except I can't get rid of the quotes around the "name" values (e.g. 'mme006')
In the "itins" table, "ttf", "us1", and "us2" are being populated correctly. However, "anode" and "lay" are always null, and "dwt" has values like #0 4236 and 0.01 42, always 8 characters long, borrowing part of what should be in "anode".
What am I doing wrong?
DEQUOTE() will remove matched quotation marks.
Your problem with dwt is that you'll need to tell it what informat to use; so if dwt is four long, :$4. instead of just $.
However, anode is a problem. The solution I came up with is:
data lines itins;
infile in1 missover;
input #1 first $1. #;
if first in ('c','t') then delete;
else if first='a' then do;
input name $ type $;
output lines; end;
else do;
input #1 path= $ #;
if path='no' then delete;
else do;
if substr(_infile_,5,1)='d' then do;
input dwt= :$12. ttf= us1= us2= us3= lay=;
anode=input(scan(dwt,2,' '),best.);
dwt=scan(dwt,1,' ');
output itins;
end;
else do;
input #5 anode 5. lay=;
output itins;
end;
end;
end;
run;
Basically, check for plan first; then if it's not a plan row, check for the 'd' in dwt. If that's present, read in a line like that, incorporating anode into dwt and then splitting it off later. If it's not present, just read in anode and lay.
If dwt can have widths other than 2-4 such that it might need to be shorter, then this probably won't work, and you'll have to explicitly figure out the position of anode to read it in properly.