How to change the first letter of the column name to upper case in Polars? - python-polars

I need to change the first letter of all column names to upper case. Is there a quick way to do it? I've found only rename() and prefix() functions.

Alternative solution, which doesn't involve assigning to a property:
df = df.rename({col: col.capitalize() for col in df.columns})

You can use df.columns (like in pandas):
df.columns = [col.capitalize() for col in df.columns]

I've found another one:
df.select(
pl.all().map_alias(lambda colName: colName.capitalize())
)

Related

How to filter dataframe columns that start with something and end with something

I have this piece of code currently that works as intended
val rules_list = df.columns.filter(_.startsWith("rule")).toList
However this is including some columns that I don't want. How would I add a second filter to this so that the total filter is "columns that start with 'rule' and end with any integer value"
So it should return "rule_1" in the list of columns but not "rule_1_modified"
Thanks and have a great day!
You can simply add a regex expression to your filter:
val rules_list = data.columns.filter(c => c.startsWith("rule") && c.matches("^.*\\d$")).toList
You can use python's Regex module like this
import re
columns = df.columns;
rules_list = [];
for col_name in range(len(columns)):
rules_list += re.findall('rule[_][0-9]',columns[col_name])
print(rules_list)

PySpark - ValueError: Cannot convert column into bool

So I've seen this solution:
ValueError: Cannot convert column into bool
which has the solution I think. But I'm trying to make it work with my dataframe and can't figure out how to implement it.
My original code:
if df2['DayOfWeek']>=6 :
df2['WeekendOrHol'] = 1
this gives me the error:
Cannot convert column into bool: please use '&' for 'and', '|' for
'or', '~' for 'not' when building DataFrame boolean expressions.
So based on the above link I tried:
from pyspark.sql.functions import when
when((df2['DayOfWeek']>=6),df2['WeekendOrHol'] = 1)
when(df2['DayOfWeek']>=6,df2['WeekendOrHol'] = 1)
but this is incorrect as it gives me an error too.
To update a column based on a condition you need to use when like this:
from pyspark.sql import functions as F
# update `WeekendOrHol` column, when `DayOfWeek` >= 6,
# then set `WeekendOrHol` to 1 otherwise, set the value of `WeekendOrHol` to what it is now - or you could do something else.
# If no otherwise is provided then the column values will be set to None
df2 = df2.withColumn('WeekendOrHol',
F.when(
F.col('DayOfWeek') >= 6, F.lit(1)
).otherwise(F.col('WeekendOrHol')
)
Hope this helps, good luck!
Best answer as provided by pault:
df2=df2.withColumn("WeekendOrHol", (df2["DayOfWeek"]>=6).cast("int"))
This is a duplicate of:
this

PySpark list() in withColumn() only works once, then AssertionError: col should be Column

I have a DataFrame with 6 string columns named like 'Spclty1'...'Spclty6' and another 6 named like 'StartDt1'...'StartDt6'. I want to zip them and collapse into a columns that looks like this:
[[Spclty1, StartDt1]...[Spclty6, StartDt6]]
I first tried collapsing just the 'Spclty' columns into a list like this:
DF = DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6')))
This worked the first time I executed it, giving me a new column called 'Spclty' containing rows such as ['014', '124', '547', '000', '000', '000'], as expected.
Then, I added a line to my script to do the same thing on a different set of 6 string columns, named 'StartDt1'...'StartDt6':
DF = DF.withColumn('StartDt', list(DF.select('StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6'))))
This caused AssertionError: col should be Column.
After I ran out of things to try, I tried the original operation again (as a sanity check):
DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6'))).collect()
and got the assertion error as above.
So, it would be good to understand why it only worked the first time (only), but the main question is: what is the correct way to zip columns into a collection of dict-like elements in Spark?
.withColumn() expects a column object as second parameter and you are supplying a list.
Thanks. After reading a number of SO posts I figured out the syntax for passing a set of columns to the col parameter, using struct to create an output column that holds a list of values:
DF_tmp = DF_tmp.withColumn('specialties', array([
struct(
*(col("Spclty{}".format(i)).alias("spclty_code"),
col("StartDt{}".format(i)).alias("start_date"))
)
for i in range(1, 7)
]
))
So, the col() and *col() constructs are what I was looking for, while the array([struct(...)]) approach lets me combine the 'Spclty' and 'StartDt' entries into a list of dict-like elements.

Map Triples to IDs/numbers using ZipWithIndex/ZipWithUniqueID

I asked the question before but it was unclear so I added more explanation to be more clear and to get help.
replace strings with ZipWithIndex/ZipWithUniqueID
I am trying to map string to number using ZipWithIndex OR ZipWithUniqueID
lets say I have this format
("u1",("name", "John Sam"))
("u2",("age", "twinty Four"))
("u3",("name", "sam Blake"))
I want this result
(0,(3,4))
(1,(5,6))
(2,(3,8))
I tried to use zipWithIndex directly to the triples but I got each letter mapped to a number I want to map the whole string without dividing it.
and tried to extract the first element in the key, value pair
so I did
val first = file.map(line=> line._1).distinct()
then apply ZipWithIndex
val z1= first.ZipWithIndex()
I got result like this
("u1",0)
("u2",1)
("u3",2)
now I need to take the ids/numbers and change it in my original file. and I need to keep all the distinct ids/numbers in hashTable to be able to look for them later on.
is there any way to do that? Any suggestions?
I hope you got my question
You mean something like this?
val file = List(("u1",("name", "John Sam")),
("u2",("age", "twinty Four")),
("u3",("name", "sam Blake")))
val first = file.map(line=> line._1) ++
file.flatMap(line=> List(line._2._1, line._2._2)).distinct
val z1: Map[String,Int] = Map[String,Int](first.zipWithIndex:_*)
file.map{ l =>
(z1(l._1),
(z1(l._2._1), z1(l._2._2)))
}

find line number in an unstructured file in scala

Hi guys I am parsing an unstructured file for some key words but i can't seem to easily find the line number of what the results I am getiing
val filePath:String = "myfile"
val myfile = sc.textFile(filePath);
var ora_temp = myfile.filter(line => line.contains("MyPattern")).collect
ora_temp.length
However, I not only want to find the lines that contains MyPatterns but I want more like a tupple (Mypattern line, line number)
Thanks in advance,
You can use ZipWithIndex as eliasah pointed out in a comment (with probably the most succinct way to do this using the direct tuple accessor syntax), or like so using pattern matching in the filter:
val matchingLineAndLineNumberTuples = sc.textFile("myfile").zipWithIndex().filter({
case (line, lineNumber) => line.contains("MyPattern")
}).collect