Truncate multiple columns in PySpark Python

Truncate multiple columns in PySpark Python - pyspark

I have a table with multiple columns and only some of these columns need to be truncated down.
For example a text field might go beyond 7 characters and this needs to be reduced.
lets say I have df:
Column A
Column B
Column C
Column D
aaaaaaaaaa
12345
abcdefg
Cell1
bbbbbbbbbb
12345
abcdefg
Cell2
cccccccccc
12345
abcdefg
Cell3
dddddddddd
12345
abcdefg
Cell4
eeeeeeeeee
12345
abcdefg
Cell5
ffffffffff
12345
abcdefg
Cell6
gggggggggg
12345
abcdefg
Cell7
I can see that Columns A and C need truncating down to 5 characters.
``col_to_truncate = ['Column A', 'Column C']
df.withColumn('Column A', substring('Column A', 1, 5)).withColumn('Column C', substring('Column C', 1, 5))
``
The code will work but what if I want to process lots of columns dynamically, is my only option using a for loop?
Is it possible to use list comprehension rather than a for loop?

You can use select instead of withColumn like so:
df.select(*[substring(c, 1, 5).alias(c) for c in df.columns])

Related

When merging two queries in Power BI, can I exact match on one key and fuzzy match on a second key?

I am merging two tables in Power BI where I want to have an exact match on one field, then a fuzzy match on a second field.
In the example below, I want for there to be an exact match for the "Key" columns in Table 1 and Table 2. In table 2, the "Key" column is not a unique identifier and can have multiple names associated with a key. So, I want to then fuzzy match on the name column. Is there a way to do this in Power BI?
Table 1
Key
Name1
info_a
1
Michael
a
2
Robert
b
Table 2
Key
Name2
info_b
1
Mike
aa
1
Andrea
cc
2
Robbie
bb
2
Michelle
dd
Result
Key
Name1
Name2
info_a
info_b
1
Michael
Mile
a
aa
2
Robert
Robbie
b
bb

I ended up using a Python script to solve this problem.
I merged Table 1 and Table 2 on the field ("Key") where an exact match was required.
Then I added this Python script:
from fuzzywuzzy import fuzz
def get_fuzz_score(
df: pd.DataFrame, col1: str, col2: str, scorer=fuzz.token_sort_ratio
) -> pd.Series:
"""
Parameters
----------
df: pd.DataFrame
col1: str, name of column from df
col2: str, name of column from df
scorer: fuzzywuzzy scorer (e.g. fuzz.ratio, fuzz.Wratio, fuzz.partial_ratio, fuzz.token_sort_ratio)
Returns
-------
scores: pd.Series
"""
scores = []
for _, row in df.iterrows():
if row[col1]in [np.nan, None] or row[col2] in [np.nan, None]:
scores.append(None)
else:
scores.append(scorer(row[col1], row[col2]))
return scores
dataset['fuzzy_score'] = get_fuzz_score(dataset, 'Name1', 'Name2', fuzz.WRatio)
dataset['MatchRank'] = dataset.groupby(['Key'])['fuzzy_score'].rank('first', ascending=False)
Then I could just consider the matches where MatchRank = 1

String match in Postgresql

I am trying to make separate columns in my query result for values stored in in a single column. It is a string field that contains a variety of similar values stored like this:
["john"] or ["john", "jane"] or ["john", "john smith', "jane"],etc... where each of the values in quotes is a distinct value. I cannot seem to isolate just ["john"] in a way that will return john and not john smith. The john smith value would be in a separate column. Essentially a column for each value in quotes. Note, I would like the results to not contain the quotes or the brackets.
I started with:
Select name
From namestbl
Where name like %["john"]%;
I think this is heading in the wrong direction. I think this should be in select instead of where.
Sorry about the format, I give up trying to figure out the completely useless error message when i try to save this with table markdown.

Your data examples represent valid JSON array syntax. So cast them to JSONB array and access individual elements by position (JSON arrays are zero based). The t CTE is a mimic of real data. In the illustration below the number of columns is limited to 6.
with t(s) as
(
values
('["john", "john smith", "jane"]'),
('["john", "jane"]'),
('["john"]')
)
select s::jsonb->>0 name1, s::jsonb->>1 name2, s::jsonb->>2 name3,
s::jsonb->>3 name4, s::jsonb->>4 name5, s::jsonb->>5 name6
from t;
Here is the result.
name1
name2
name3
name4
name5
name6
john
john smith
jane
john
jane
john

readtable on text file ignores first row which contains the column names

I have a tab delimited text file which contains some data organised into columns with the first row acting as column names such as:
TN Stim Task RT
1 A A 500.2
2 B A 569
3 C A 654
and so on.
I am trying to read this textfile into MATLAB(r2018a) using readtable with
Data1 = readtable(filename);
I manage to get all the data in Data1 table, but the column names are showing as Var1, Var2 etc. If I use Name Value pairs to specify to read first row as column names as in:
Data1 = readtable(filename, 'ReadVariableNames', true);
then I get the column names as the first data row, i.e.
1 A A 500.2
So it just looks like it is ignoring the first row completely. How can I modify the readtable call to use the entries on the first row as column names?

I figured it out. It appears there was an additional tab in some of the rows after the last column. Because of this, readtable was reading it as an additional column, but did not have a column name to assign to it. It seems that if any of the column names are missing, it names them all as Var1, Var2, etc.

Based on the way your sample file text is formatted above, it appears that the column labels are separated by spaces instead of by tabs the way the data is. In this case, readtable will assume (based on the data) that the delimiter is a tab and treat the column labels as a header line to skip. Add tabs between them and you should be good to go.
Test with spaces between column labels:
% File contents:
TN Stim Task RT
1 A A 500.2
2 B A 569
3 C A 654
>> Data1 = readtable('sample_table.txt')
Data1 =
Var1 Var2 Var3 Var4 % Default names
____ ____ ____ _____
1 'A' 'A' 500.2
2 'B' 'A' 569
3 'C' 'A' 654
Test with tabs between column labels:
% File contents:
TN Stim Task RT
1 A A 500.2
2 B A 569
3 C A 654
>> Data1 = readtable('sample_table.txt')
Data1 =
TN Stim Task RT
__ ____ ____ _____
1 'A' 'A' 500.2
2 'B' 'A' 569
3 'C' 'A' 654

Scenario based questions in Datastage

I have two scenario based questions here.
Question 1
Input Dataset
Col1
A
A
B
C
C
B
D
A
C
Output Dataset
Col1 Col2
A 1
A 2
A 3
B 1
B 2
C 1
C 2
C 3
D 1
Question2
Input data string
AA-BB-CC-DD-EE-FF (can be of any delimiter and string can have any length)
Output data string
string 1 -> AA
string 2 -> BB
string 3 -> CC
string 4 -> DD
Thanks & Regards,
Subhasree

Question 1: Can be solved with a transformer. Sort the data and use the lastrowingroup functionality.
For Col2 just create a counter as a stage variable and add 1 for each row - if reset it with a second stage variable if lastrowingroup is reached.
Aternatively you could use a rownumber column in SQL.
Question2: You have not provided enough information. Is string1 a column or row? If you do not know anything upfront about the structure (any delimiter) this will get hard...

Perl + PostgreSQL-- Selective Column to Row Transpose

I'm trying to find a way to use Perl to further process a PostgreSQL output. If there's a better way to do this via PostgreSQL, please let me know. I basically need to choose certain columns (Realtime, Value) in a file to concatenate certains columns to create a row while keeping ID and CAT.
First time posting, so please let me know if I missed anything.
Input:
ID CAT Realtime Value
A 1 time1 55
A 1 time2 57
B 1 time3 75
C 2 time4 60
C 3 time5 66
C 3 time6 67
Output:
ID CAT Time Values
A 1 time 1,time2 55,57
B 1 time3 75
C 2 time4 60
C 3 time5,time6 66,67

You could do this most simply in Postgres like so (using array columns)
CREATE TEMP TABLE output AS SELECT
id, cat, ARRAY_AGG(realtime) as time, ARRAY_AGG(value) as values
FROM input GROUP BY id, cat;
Then select whatever you want out of the output table.

SELECT id
, cat
, string_agg(realtime, ',') AS realtimes
, string_agg(value, ',') AS values
FROM input
GROUP BY 1, 2
ORDER BY 1, 2;
string_agg() requires PostgreSQL 9.0 or later and concatenates all values to a delimiter-separated string - while array_agg() (v8.4+) creates am array out of the input values.
About 1, 2 - I quote the manual on the SELECT command:
GROUP BY clause
expression can be an input column name, or the name or ordinal number
of an output column (SELECT list item), or ...
ORDER BY clause
Each expression can be the name or ordinal number of an output column
(SELECT list item), or
Emphasis mine. So that's just notational convenience. Especially handy with complex expressions in the SELECT list.