CASE where a lot of text needs trimming - tsql

I have an output in my queries that gives me:
xxxxxxx xxxxxxxxx : 123456 (xx) - xxxxxxx...
or
xxxxxxx xxxxxxxxx : 12345678 (xx) - xxxxxxx...
basically text before either a 6 or 8 digit number then text after.
Ideally I'd like to be able to CASE this column so I'd have an output where it's a 6 digit number = London and the output when it's an 8 digit number = Paris.
But I am very stuck on how to get the CASE statement to achieve this - essentially trim out a lot of text, work out if the number is either 6 or 8 digits long, then tell me if it's London or Paris. I'm not sure if it's possible.
Is this sort of CASE statement achievable? Advise / pointers would be very gratefully received. Thanks very much.

One way to do it is using a common table expression with patindex.
First, create and populate sample table (Please save us this step in your future questions)
DECLARE #T As Table
(
LongText varchar(4000)
)
INSERT INTO #T (LongText) VALUES
('.njauaerigha n uaer gauer 345 gnaehn 123456 (43) smgmshmsrtmh s;s nt;srtn ;nbtugarg '),
('asdfasdfasdf 12345678 (65) asdfag gr 34 6sd 64 fasd fasdfas d fasdfasdf asdf'),
('zxcvzxcvzx34cv zxcvzxcv zxcv zxcv zxcv zcxvz xcv z3 45 xcvzxcz dfg dv df df zfd zdf b 654321 (77)'),
('87654321 (99) n;arng an ; ualerg trhrt srth str sth strh ssth'),
('snhs tgnn ang nu g;arug aegaerlhae s ;5 afnauierhga.ngae489tj 8q3y .sn.5yn b.s n .5hy 5');
Then, the common table expression to get the start index of the 8/6 digit number:
WITH CTE AS
(
SELECT LongText,
PATINDEX('%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] ([0-9][0-9])%', LongText) As Long,
PATINDEX('%[0-9][0-9][0-9][0-9][0-9][0-9] ([0-9][0-9])%', LongText) As Short
FROM #T
)
Then, select with a couple of case expressions to extract the number and get the city:
SELECT CASE
WHEN Long > 0 THEN SUBSTRING(LongText, Long, 13)
WHEN Short > 0 THEN SUBSTRING(LongText, Short, 11)
END As Number,
CASE
WHEN Long > 0 THEN 'Paris'
WHEN Short > 0 THEN 'London'
END As City,
LongText
FROM CTE
Results:
Number City LongText
123456 (43) London .njauaerigha n uaer gauer 345 gnaehn 123456 (43) smgmshmsrtmh s;s nt;srtn ;nbtugarg
12345678 (65) Paris asdfasdfasdf 12345678 (65) asdfag gr 34 6sd 64 fasd fasdfas d fasdfasdf asdf
654321 (77) London zxcvzxcvzx34cv zxcvzxcv zxcv zxcv zxcv zcxvz xcv z3 45 xcvzxcz dfg dv df df zfd zdf b 654321 (77)
87654321 (99) Paris 87654321 (99) n;arng an ; ualerg trhrt srth str sth strh ssth
NULL NULL snhs tgnn ang nu g;arug aegaerlhae s ;5 afnauierhga.ngae489tj 8q3y .sn.5yn b.s n .5hy 5

Related

Postgresql: Concatenating regexp_split_to_table into a single string

In Postgresql, I need to take an ASCII string like CNR39XL and turn it into a number like 21317392311 (ASCII values - 65, but digits left as is). This is to take some logic in our software and push it down into the databse level. I’m partway there with this:
ourdb=> SELECT CASE
ourdb-> WHEN ASCII(c) < 65 THEN c::integer
ourdb-> ELSE ASCII(c)-65
ourdb-> END
ourdb-> FROM regexp_split_to_table('CNR39XL','') s(c);
case
------
2
13
17
3
9
23
11
(7 rows)
However, I can't figure out how to take that final set of numbers and convert it into the string. What am I looking for?
Use string_agg
SELECT CAST(
string_agg(
CAST (CASE ... END AS text),
''
)
AS numeric)
FROM ...

SQL Server 2012 finding text before and after delimiter and sometimes without delimiter

I have been having fun with an issue where I need to break apart a string in SQL Server 2012 and test for values it may or may not contain. The values, when present, will be separated by up to two different ; symbols.
When there is nothing, it will be blank.
When there is a single value, it will show up without the delimiter.
When there are two or more, up to 3, they will be separated by the delimiter.
As I said, if there is nothing in the record, it will be blank. Below are some example of how the data may come across:
' ',
'1',
'24',
'15;1;24',
'10;1;22',
'5;1;7',
'12;1',
'10;12',
'1;5',
'1;1;1',
'15;20;22'
I have searched the forums and found many clues, but I have not been able to come up with a total solution given all potential data values. Essentially, I would like to break it into 3 separate values.
text before the first delimiter or in the absence of the delimiter, just the text.
Text after the first delimiter and before the second in situation where there are two delimiters.
The following has worked consistently:
substring(SUBSTRING(Food_Desc, charindex(';', Food_Desc) + 1, 4), 0,
charindex(';', SUBSTRING(Food_Desc, charindex(';', Food_Desc) + 1, 4))) as [Middle]
Text after the second delimiter in the even there are two delimiters and there is a third value
The main challenge is the fact that the delimiter, when present, moves depending on the value in the table. values 1-9 make it show up as the second character in the string, values 10-24 make it show up as the 3rd, etc.
Any help would be greatly appreciated.
This is simple if you have a well written t-sql splitter function. For this solution I'm using Jeff Moden's delimitedsplit8k.
sample data and solution
DECLARE #table table (someid int identity, sometext varchar(100));
INSERT #table VALUES (' '),('1'),('24'),('15;1;24'),('10;1;22'),
('5;1;7'),('12;1'),('10;12'),('1;5'),('1;1;1'),('15;20;22');
SELECT
someid,
sometext,
ItemNumber,
Item
FROM #table
CROSS APPLY dbo.DelimitedSplit8K_LEAD(sometext, ';');
results
someid sometext ItemNumber Item
----------- ----------------- ----------- --------
1 1
2 1 1 1
3 24 1 24
4 15;1;24 1 15
4 15;1;24 2 1
4 15;1;24 3 24
5 10;1;22 1 10
5 10;1;22 2 1
5 10;1;22 3 22
6 5;1;7 1 5
6 5;1;7 2 1
6 5;1;7 3 7
7 12;1 1 12
7 12;1 2 1
8 10;12 1 10
8 10;12 2 12
9 1;5 1 1
9 1;5 2 5
10 1;1;1 1 1
10 1;1;1 2 1
10 1;1;1 3 1
11 15;20;22 1 15
11 15;20;22 2 20
11 15;20;22 3 22
Below is a modified version of a similar question How do I split a string so I can access item x?. Changing the text value for #sample to each of your possibilities listed seemed to work for me.
DECLARE #sample VARCHAR(200) = '15;20;22';
DECLARE #individual VARCHAR(20) = NULL;
WHILE LEN(#sample) > 0
BEGIN
IF PATINDEX('%;%', #sample) > 0
BEGIN
SET #individual = SUBSTRING(#sample, 0, PATINDEX('%;%', #sample));
SELECT #individual;
SET #sample = SUBSTRING(#sample, LEN(#individual + ';') + 1, LEN(#sample));
END;
ELSE
BEGIN
SET #individual = #sample;
SET #sample = NULL;
SELECT #individual;
END;
END;

While copying data from S3 to redshift, I am facing error which is :

Missing newline: Unexpected character 0x22 found at location 0
Table Create statement is
CREATE TABLE venue (
City varchar(45) ,
Country varchar(2) ,
Description varchar(82) ,
lat_lon varchar(30) ,
Region varchar(30) ,
State varchar(15) ,
Venue_Config_ID int ,
zip varchar(8) ,
CT_ID int ,
CN_ID int ,
DS_ID int ,
RG_ID int ,
ZP_ID int ,
ST_ID int
)
and sample line from CSV file is
" D e n v e r"," U S"," E l l i e C a u l k i n s O p e r a H o u s e"," 3 9 . 7 4 3 6 4 7 9 , - 1 0 4 . 9 9 8 1 4"," D e n v e r"," C O","1230057"," 8 0 2 0 4","11","1","8771","11","2673","11"
Any help would be appreciated
0x22 is a quotation mark ("). This could be related to the fact that you are loading int fields with text that is quoted.
Try using the REMOVEQUOTES option on your COPY command to remove the quotes.
I had the same problem, what I found was that my csv files were unicode, and Redshift COPY expects UTF-8:
By default, the COPY command expects the source data to be in character-delimited UTF-8 text files. The default delimiter is a pipe character ( | ). If the source data is in another format, use the following parameters to specify the data format.

How to move pairs of columns to new table, saving FK to the old table and creating a new column with a value as a name of column?

I downloaded database with translations of countries and cities to 70 languages (some of translations are ''), but translations and technical information (populations, flags, territory, phones, etc) about cities\countries saved in the same table.
I mean every translation has its own columns (tranlation itself + description on the language of translation) next to the other info which is not related to translation. Totally about 190 colums, including 70*2 (translation + description).
I don't think this is proper way and I want to move all translations to seperated table keeping FK to main\technical-info table.
So, now I have a table "cities" with the structure like below:
id region_id countries_id phone population lang_1 description_1 lang_2 description_2 lang_3 description_3 .... lang_70 description_70
1 1 1 +7 123 Москва SomeDesc Moscow SomeDesc2 Moskwa SomeText3 Translation70 SomeDesc70
2 1 1 +7 123 Кубинка SomeDesc Kubinka SomeDesc2 Kubinka '' Translation70 SomeDesc70
with 2.5M rows\cities.
I want to move all "lang_(1-70)" and their descriptions to new table "cities_translated" which should look like that:
id cities_id name description lang
1 1 Москва SomeDesc lang_1
2 1 Moscow SomeDesc2 lang_2
3 1 Moskwa SomeText3 lang_3
...
70 1 Translation70 SomeDesc70 lang_70
71 2 Кубинка SomeDesc lang_1
72 2 Kubinka SomeDesc2 lang_2
73 2 Kubinka SomeDesc3 lang_3
...
140 2 Translation70 SomeDesc70 lang_70
Could anyone help me please with proper query to do this transfer?
P.S. I have already a table "languages" and as the next step I will replace all values like 'lang_1', 'lang_2' and so on to proper FK.
Hoped to get raw sql solution in order to improve my sql knowledge, but due to no anwers, i decided to use Python.
initial_table = 'countries.city'
init_table_columns = ['lang_1', 'description_1', 'lang_2', 'description_2', 'lang_3', 'description_3', 'lang_4', 'description_4', 'lang_5', 'description_5', 'lang_6', 'description_6', 'lang_7', 'description_7', 'lang_8', 'description_8', 'lang_9', 'description_9', 'lang_10', 'description_10', 'lang_11', 'description_11', 'lang_12', 'description_12', 'lang_13', 'description_13', 'lang_14', 'description_14', 'lang_15', 'description_15', 'lang_16', 'description_16', 'lang_17', 'description_17', 'lang_18', 'description_18', 'lang_19', 'description_19', 'lang_20', 'description_20', 'lang_21', 'description_21', 'lang_22', 'description_22', 'lang_23', 'description_23', 'lang_24', 'description_24', 'lang_25', 'description_25', 'lang_26', 'description_26', 'lang_27', 'description_27', 'lang_28', 'description_28', 'lang_29', 'description_29', 'lang_30', 'description_30', 'lang_31', 'description_31', 'lang_32', 'description_32', 'lang_33', 'description_33', 'lang_34', 'description_34', 'lang_35', 'description_35', 'lang_36', 'description_36', 'lang_37', 'description_37', 'lang_38', 'description_38', 'lang_39', 'description_39', 'lang_40', 'description_40', 'lang_41', 'description_41', 'lang_42', 'description_42', 'lang_43', 'description_43', 'lang_44', 'description_44', 'lang_45', 'description_45', 'lang_46', 'description_46', 'lang_47', 'description_47', 'lang_48', 'description_48', 'lang_49', 'description_49', 'lang_50', 'description_50', 'lang_51', 'description_51', 'lang_52', 'description_52', 'lang_53', 'description_53', 'lang_54', 'description_54', 'lang_55', 'description_55', 'lang_56', 'description_56', 'lang_57', 'description_57', 'lang_58', 'description_58', 'lang_59', 'description_59', 'lang_60', 'description_60', 'lang_61', 'description_61', 'lang_62', 'description_62', 'lang_63', 'description_63', 'lang_64', 'description_64', 'lang_65', 'description_65', 'lang_66', 'description_66', 'lang_67', 'description_67', 'lang_68', 'description_68', 'lang_69', 'description_69', 'lang_70', 'description_70']
table_translation = 'countries.city_translated'
import psycopg2
import re
conn = psycopg2.connect(database="countries", host='localhost', user="postgres", password="Password")
cur = conn.cursor()
new_cursor = conn.cursor()
cur.execute("""SELECT id FROM %s """ % initial_table)
rows = cur.fetchall()
print("%i rows retrieved" % cur.rowcount)
new_cursor.execute("""BEGIN""")
for row in rows:
print('row:', row)
get_id = row[0]
cur.execute("""SELECT %s FROM %s WHERE id=%s """ % (",".join(init_table_columns), initial_table, get_id))
row_w_info = cur.fetchall()
for i in range(140):
if i%2==0:
name = row_w_info[0][i]
description = row_w_info[0][i+1]
lang_text = init_table_columns[i]
lang_id = int(re.findall(r'\d+', lang_text)[0])
# There are 70 translations, but there is no info what languages are 68, 69, 70
if lang_id >= 68:
lang_id = None
new_cursor.execute("INSERT INTO countries.city_translated (city_id, name, description, lang_id) VALUES (%s, %s, %s, %s)", (get_id, name, description, lang_id))
new_cursor.execute("""COMMIT""")
cur.close()
new_cursor.close()
conn.close()

Extracting values from non-standard markup strings in PostgreSQL

Unfortunately, I have a table like the following:
DROP TABLE IF EXISTS my_list;
CREATE TABLE my_list (index int PRIMARY KEY, mystring text, status text);
INSERT INTO my_list
(index, mystring, status) VALUES
(12, '', 'D'),
(14, '[id] 5', 'A'),
(15, '[id] 12[num] 03952145815', 'C'),
(16, '[id] 314[num] 03952145815[name] Sweet', 'E'),
(19, '[id] 01211[num] 03952145815[name] Home[oth] Alabama', 'B');
Is there any trick to get out number of [id] as integer from the mystring text shown above? As though I ran the following query:
SELECT index, extract_id_function(mystring), status FROM my_list;
and got results like:
12 0 D
14 5 A
15 12 C
16 314 E
19 1211 B
Preferably with only simple string functions and if not regular expression will be fine.
If I understand correctly, you have a rather unconventional markup format where [id] is followed by a space, then a series of digits that represents a numeric identifier. There is no closing tag, the next non-numeric field ends the ID.
If so, you're going to be able to do this with non-regexp string ops, but only quite badly. What you'd really need is the SQL equivalent of strtol, which consumes input up to the first non-digit and just returns that. A cast to integer will not do that, it'll report an error if it sees non-numeric garbage after the number. (As it happens I just wrote a C extension that exposes strtol for decoding hex values, but I'm guessing you don't want to use C extensions if you don't even want regex...)
It can be done with string ops if you make the simplifying assumption that an [id] nnnn tag always ends with either end of string or another tag, so it's always [ at the end of the number. We also assume that you're only interested in the first [id] if multiple appear in a string. That way you can write something like the following horrible monstrosity:
select
"index",
case
when next_tag_idx > 0 then substring(cut_id from 0 for next_tag_idx)
else cut_id
end AS "my_id",
"status"
from (
select
position('[' in cut_id) AS next_tag_idx,
*
from (
select
case
when id_offset = 0 then null
else substring(mystring from id_offset + 4)
end AS cut_id,
*
from (
select
position('[id] ' in mystring) AS id_offset,
*
from my_list
) x
) y
) z;
(If anybody ever actually uses that query for anything, kittens will fall from the sky and splat upon the pavement, wailing in horror all the way down).
Or you can be sensible and just use a regular expression for this kind of string processing, in which case your query (assuming you only want the first [id]) is:
regress=> SELECT
"index",
coalesce((SELECT (regexp_matches(mystring, '\[id\]\s?(\d+)'))[1])::integer, 0) AS my_id,
status
FROM my_list;
index | my_id | status
-------+----------------+--------
12 | 0 | D
14 | 5 | A
15 | 12 | C
16 | 314 | E
19 | 01211 | B
(5 rows)
Update: If you're having issues with unicode handling in regex, upgrade to Pg 9.2. See https://stackoverflow.com/a/14293924/398670