I am removing some words from a column. at the end of the day some rows will be empty becouse all their string has been removed. there might be space or whitespace or nothing. How I can remove these rows?
I tried this but for some reason it does not work for all kind of rows:
df = df.withColumn('col1',trim(regexp_replace('col1','\n','')))
df=df.filter(df.col1!='')
the filter you've applied will work for blanks, but not if it has whitespaces.
try trim(<column>) != ''.
Example
spark.sparkContext.parallelize([('',), (' ',), (' ',)]).toDF(['foo']). \
filter(func.col('foo') != ''). \
count()
# 2
spark.sparkContext.parallelize([('',), (' ',), (' ',)]).toDF(['foo']). \
filter(func.trim(func.col('foo')) != ''). \
count()
# 0
Related
My PySpark data field has a column with a value of the form 0000-00-00-00-00-00-000_000.xxxx where 0 is a digit and x is a letter. The value represents an observation timestamp with some other values mixed in.
In my notebook, I have a cell that attempts to split the column containing the timestamp. For the most part, it works.I get most of the work done with the following:
`splitDF = ( df
.withColumn("fn_year", split(df["fn"], "-").getItem(0))
.withColumn("fn_month", split(df["fn"], "-").getItem(1))
.withColumn("fn_day", split(df["fn"], "-").getItem(2))
.withColumn("fn_hour", split(df["fn"], "-").getItem(3))
.withColumn("fn_min", split(df["fn"], "-").getItem(4))
.withColumn("fn_sec", split(df["fn"], "-").getItem(5))
.withColumn("fn_milli", split(df["fn"], "-").getItem(6))
)`
I need to extract two values from the string; the 000 preceding the underscore and the 000 following the underscore. I would normally (my usual language / environment is C# / .NET 7, web API stuff) just split the string multiple times using the two delimiters ('_' and '.') and grab the necessary components. I can't get that to work in this case. When I try to pass the split into another split I get ["", "", "", "", "", "", "", "", ""] for the result (.getItem(x) omitted).
Here's an example of what I thought might work to split on the underscore and then the period:
splitDF = df.withColumn("fn_qc", split(split(df["fn"], "_").getItem(1), ".").getItem(0))
Basically we split string based on dash - it would return an array which is used across. In the last statement, we split again based on underscore. For the last value you could use a substring or split again based on period or just replace xxxx if it is a static value...
Hope this helps.
from pyspark.sql.functions import split, col, substring
date_list = [["2023-01-02-03-04-05-666_777.xxxx"], ["2023-12-11-10-09-08-444_333.xxxx"]]
cols = ["fn"]
df = spark.createDataFrame(date_list, cols)
splitDF = df.withColumn("split_on_dash", split(col("fn"), "-")) \
.withColumn("fn_year", col("split_on_dash")[0]) \
.withColumn("fn_month", col("split_on_dash")[1]) \
.withColumn("fn_day", col("split_on_dash")[2]) \
.withColumn("fn_hour", col("split_on_dash")[3]) \
.withColumn("fn_min", col("split_on_dash")[4]) \
.withColumn("fn_sec", col("split_on_dash")[5]) \
.withColumn("fn_milli", split(col("split_on_dash")[6], "_")[0]) \
.withColumn("fn_after_underscore", substring(split(col("split_on_dash")[6], "_")[1], 0, 3))
display(splitDF)
You can select only the required columns later or drop the unnecessary ones...
a have a column as below
mystring
AC1853551,AC1854125,AC1855220,AC188115,AC1884120,AC1884390,AC1885102
I need to transformm it to get this output
mystring
('AC1853551','AC1854125','AC1855220','AC188115','AC1884120','AC1884390','AC1885102')
Here is my query that i tried
select CONCAT('( , CONCAT (mystring, ')')) from mytablename
I'm getting an error when it comes to insert a single quote '
Then i thought about replacing the comma with a ','
How to get desired output
i'm using postgres 10
A literal quote is coded as a doubled quote:
select '(''' || replace(mycolumn, ',', ''',''') || ''')'
from mytable
See live demo.
I have strings like this
#
word_1
word_2
#
word_3
#
#
where # represents empty lines.
I'd like to remove those empty lines, for getting
word_1
word_2
word_3
I've tried replacing CHR(10) and CHR(13) with '' but then I get
word_1word_2word_3
I've seen I can remove the first empty line using LTRIM, but how to get rid of all of them?
You must remove all new-line characters followed by new-line character, and a single new-line character at the start and the end of a string. All these replacements can be done with a single expression.
Starting from v11.1
select regexp_replace (s, '\r\n(?=\r\n)|^\r\n|\r\n$', '')
from (values x'0d0a' || 'abc' || x'0d0a0d0a'|| 'def' || x'0d0a') t (s)
Note, that you may have a new-line character encoded as x'0a' instead of x'0d0a'. Remove all the \r characters in this case from the expression above.
dbfiddle link.
Starting from v9.7
select xmlcast (xmlquery ('replace (replace ($d, "^\r\n|\r\n$", ""), "(\r\n){2,}", "$1")' passing s as "d") as varchar (100))
from (values x'0d0a' || 'abc' || x'0d0a0d0a'|| 'def' || x'0d0a') t (s)
dbfiddle link.
How to return last n words using Postgres.
I have tried using LEFT method.
SELECT DISTINCT LEFT(name, -4) FROM my_table;
but it return last 4 characters ,i want to return last 3 words.
demo:db<>fiddle
You can do this using a the SUBSTRING() function and regular expressions:
SELECT
SUBSTRING(name FROM '((\S+\s+){0,3}\S+$)')
FROM my_table
This has been explained here: How can I match the last two words in a sentence in PostgreSQL?
\S+ is a string of non-whitespace characters
\s+ is a string of whitespace characters (e.g. one space)
(\S+\s+){0,3} Zero to three words separated by a space
\S+$ one word at the end of the text.
-> creates 4 words (or less if there are no more).
One way is to use regexp_split_to_array() to split the string into the words it contains and then put a string back together using the last 3 words in that array.
SELECT coalesce(w.words[array_length(w.words, 1) - 2] || ' ', '')
|| coalesce(w.words[array_length(w.words, 1) - 1] || ' ', '')
|| coalesce(w.words[array_length(w.words, 1)], '')
FROM mytable t
CROSS JOIN LATERAL (SELECT regexp_split_to_array(t."name", ' ') words) w;
db<>fiddle
RIGHT() should do
SELECT RIGHT('MYCOLUMN', 4); -- returns LUMN
UPD
You can convert to array and then back to string
SELECT array_to_string(sentence[(array_length(sentence,1)-3):(array_length(sentence,1))],' ','*')
FROM
(
SELECT regexp_split_to_array('this is the one of the way to get the last four words of the string', E'\\s+') AS sentence
) foo;
DEMO HERE
I have a table mytable that has a column ngram which is a VARCHAR2. I want to SELECT only those rows where ngram does not contain any whitespaces (tabs, spaces, EOLs etc). What should I replace <COND> below with?
SELECT ngram FROM mytable WHERE <COND>;
Thanks!
You could use regexp_instr (or regexp_like, or other regexp functions), see here for example
where regexp_instr(ngram, '[ '|| CHR(10) || CHR(13) || CHR(9) ||']') = 0
the white space is managed here '[ '
chr(10) = line feed
chr(13) = carriage return
chr(9) = tab
you can use CHR and INSTR function ASCII code of the characters you want to filter for example your where clause can be like this for an special character:
INSTR(ngram,CHR(the ASCI CODE of special char))=0
or the condition can be like this:
where
and ngram not like '%'||CHR(0)||'%' -- for null
.
.
.
and ngram not like '%'||CHR(31)||'%' -- for unit separator
and ngram not like '%'||CHR(127)||'%'-- for delete
here you can get all codes http://www.theasciicode.com.ar/extended-ascii-code/non-breaking-space-no-break-space-ascii-code-255.html
This should match ngram where it contains no whitespace characters by using the \s shorthand for all whitespace characters. I only tested by inserting a TAB into a string in a VARCHAR2 column and it was then excluded:
where regexp_instr(ngram, '\s') = 0;