How to use column names with spaces in stack function in pyspark - pyspark

I tried to unpivot the dataframe and dataframe has folowing structure
fstcol
col 1
col 2
One
1
4
one
2
5
One
3
6
And I want the dataframe like this :
fstcol
col_name
value
One
col 1
1
one
col 1
2
One
col 1
3
One
col 2
4
one
col 2
5
One
col 2
6
I have written following code to transform:
df.selectExpr("fstcol","stack(2, 'col 1', col 1, 'col 2', col 2)")
however, I am getting an error as column names contains space. It is unable to get the column values for 'col 1' and 'col 2'.
Can anyone help me to resolve this?

You can use backticks like below
df.selectExpr("fstcol","stack(2, 'col 1', `col 1`, 'col 2', `col 2`)")

You must use backtick:
df.selectExpr("fstcol", "stack(2, 'col 1', `col 1`, 'col 2', `col 2`) as (col_name, value)")

Related

Truncate multiple columns in PySpark Python

I have a table with multiple columns and only some of these columns need to be truncated down.
For example a text field might go beyond 7 characters and this needs to be reduced.
lets say I have df:
Column A
Column B
Column C
Column D
aaaaaaaaaa
12345
abcdefg
Cell1
bbbbbbbbbb
12345
abcdefg
Cell2
cccccccccc
12345
abcdefg
Cell3
dddddddddd
12345
abcdefg
Cell4
eeeeeeeeee
12345
abcdefg
Cell5
ffffffffff
12345
abcdefg
Cell6
gggggggggg
12345
abcdefg
Cell7
I can see that Columns A and C need truncating down to 5 characters.
``col_to_truncate = ['Column A', 'Column C']
df.withColumn('Column A', substring('Column A', 1, 5)).withColumn('Column C', substring('Column C', 1, 5))
``
The code will work but what if I want to process lots of columns dynamically, is my only option using a for loop?
Is it possible to use list comprehension rather than a for loop?
You can use select instead of withColumn like so:
df.select(*[substring(c, 1, 5).alias(c) for c in df.columns])

How to join two columns of the same table and count the number of rows

Need help with the below scenario in PostgreSQL.
I need to match column A & column B and column B & column A in Table 1 and count the number of wires based on column A and column B.
Table 1
Wire  
Device From (Column A)
Device To (Column B)
Level
Wire 1
Device 1
Device 2
level 1
Wire 2
Device 1
Device 2
level 1
Wire 3
Device 2
Device 1
level 1
Output should look like
No of Wire
Wires
Device from
Device To
Level
3
Wire 1, Wire 2, Wir3 3
Device 1
Device 2
level 1
To solve your problem in a compact way, I converted your device_from and device_to columns to numbers (integers)
CREATE TABLE wiretable (
wire text,
device_from integer,
device_to integer,
level text);
INSERT INTO wiretable
(wire, device_from, device_to, level)
VALUES('Wire 1', 1, 2, 'level 1'),
('Wire 2', 1, 2, 'level 1'),
('Wire 3', 2, 1, 'level 1');
You need a way of comparing Device 1 and Device 2 so that 1,2 is the same as 2,1. The only way I could think of to do it was by using an array and sorting. But to do array sorting, you have to install the intarray extension (CREATE EXTENSION intarray;) or do some tricks. See this question for how to sort arrays and/or install extension.
Basically you want to do a GROUP BY on two columns. I couldn't find anything like that in the documentation. So I put both columns in an array and GROUP BY that array. I have to do it in two steps.
SELECT count(*), string_agg(wire, ', '),
sort(array[device_from, device_to]) as combo,
level
FROM wiretable
GROUP BY combo, level;
count
string_agg
combo
level
3
Wire 1, Wire 2, Wire 3
{1,2}
level 1
That looks pretty good, but now we have to split combo back into two columns. We do this with a CTE
WITH foo AS (SELECT count(*), string_agg(wire, ', '),
sort(array[device_from, device_to]) as combo,
level
FROM wiretable
GROUP BY combo, level)
SELECT foo.count, foo.string_agg as wires,
foo.combo[1] as "Device From",
foo.combo[2] as "Device To",
foo.level
FROM foo
which gives
count
wires
Device From
Device To
level
3
Wire 1, Wire 2, Wire 3
1
2
level 1

Casting string like "[1, 2, 3]" to array [duplicate]

This question already has an answer here:
Handle string to array conversion in pyspark dataframe
(1 answer)
Closed 4 years ago.
Pretty straightforward. I have an array-like column encoded as a string (varchar) and want to cast it to array (so I can then explode it and manipulate the elements in "long" format).
The two most natural approaches don't seem to work:
-- just returns a length-1 array with a single string element '[1, 2, 3]'
select array('[1, 2, 3]')
-- errors: DataType array is not supported.
select cast('[1, 2, 3]' as array)
The ugly/inelegant/circuitous way to get what I want is:
select explode(split(replace(replace('[1, 2, 3]', '['), ']'), ', '))
-- '1'
-- '2'
-- '3'
(regexp_replace could subsume the two replace but regex with square brackets are always a pain; ltrim and rtrim or trim(BOTH '[]'...) could also be used)
Is there any more concise way to go about this? I'm on Spark 2.3.1.
I am assuming here that the elements are digits. But you get the idea
>>> s = '[1,2,3]'
>>> list(c for c in s if c.isdigit())
['1', '2', '3']
>>> map(int, list(c for c in s if c.isdigit()))
[1, 2, 3]

PySpark: How to concatenate two dataframes without duplicates rows?

I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
A B
0 1 2
1 3 1
Dataframe B:
A B
0 5 6
1 3 1
I wish to merge them such that the final DataFrame is of the following shape:
Final Dataframe:
A B
0 1 2
1 3 1
2 5 6
How can I do this?
pyspark.sql.DataFrame.union and pyspark.sql.DataFrame.unionAll seem to yield the same result with duplicates.
Instead, you can get the desired output by using direct SQL:
dfA.createTempView('dataframea')
dfB.createTempView('dataframeb')
aunionb = spark.sql('select * from dataframea union select * from dataframeb')
Using SQL produces the expected/correct result.
In order to remove any duplicate rows, just use union() followed by a distinct().
Mentioned in the documentation
http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
"union(other)
Return a new DataFrame containing union of rows in this frame and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct."
You have just to drop duplicates after union.
df = dfA.union(dfB).dropDuplicates()

SQL basic full-text search

I have not worked much with TSQL or the full-text search feature of SQL Server so bear with me.
I have a table nvarchar column (Col) like this:
Col ... more columns
Row 1: '1'
Row 2: '1|2'
Row 3: '2|40'
I want to do a search to match similar users. So if I have a user that has a Col value of '1' I would expect the search to return the first two rows. If I had a user with a Col value of '1|2' I would expect to get Row 2 returned first and then Row 1. If I try to match users with a Col value of '4' I wouldn't get any results. I thought of doing a 'contains' by splitting the value I am using to query but it wouldn't work since '2|40' contains 4...
I looked up the documentation on using the 'FREETEXT' keyword but I don't think that would work for me since I essentially need to break up the Col values into words using the '|' as a break.
Thanks,
John
You should not store values like '1|2' in a field to store 2 values. If you have a maximum of 2 values, you should use 2 fields to store them. If you can have 0-many values, you should store them in a new table with a foreign key pointing to the primary key of your table..
If you only have max 2 values in your table. You can find your data like this:
DECLARE #s VARCHAR(3) = '1'
SELECT *
FROM <table>
WHERE #s IN(
PARSENAME(REPLACE(col, '|', '.'), 1),
PARSENAME(REPLACE(col, '|', '.'), 2)
--,PARSENAME(REPLACE(col, '|', '.'), 3) -- if col can contain 3
--,PARSENAME(REPLACE(col, '|', '.'), 4) -- or 4 values this can be used
)
Parsename can handle max 4 values. If 'col' can contain more than 4 values use this
DECLARE #s VARCHAR(3) = '1'
SELECT *
FROM <table>
WHERE '|' + col + '|' like '%|' + #s + '|%'
Need to mix this in with a case for when there is no | but this returns the left and right hand sides
select left('2|10', CHARINDEX('|', '2|10') - 1)
select right('2|10', CHARINDEX('|', '2|10'))