Looking at pyspark, I see translate and regexp_replace to help me a single characters that exists in a dataframe column.
I was wondering if there is a way to supply multiple strings in the regexp_replace or translate so that it would parse them and replace them with something else.
Use case: remove all $, #, and comma(,) in a column A
You can use pyspark.sql.functions.translate() to make multiple replacements. Pass in a string of letters to replace and another string of equal length which represents the replacement values.
For example, let's say you had the following DataFrame:
import pyspark.sql.functions as f
df = sqlCtx.createDataFrame([("$100,00",),("#foobar",),("foo, bar, #, and $",)], ["A"])
df.show()
#+------------------+
#| A|
#+------------------+
#| $100,00|
#| #foobar|
#|foo, bar, #, and $|
#+------------------+
and wanted to replace ('$', '#', ',') with ('X', 'Y', 'Z'). Simply use translate like:
df.select("A", f.translate(f.col("A"), "$#,", "XYZ").alias("replaced")).show()
#+------------------+------------------+
#| A| replaced|
#+------------------+------------------+
#| $100,00| X100Z00|
#| #foobar| Yfoobar|
#|foo, bar, #, and $|fooZ barZ YZ and X|
#+------------------+------------------+
If instead you wanted to remove all instances of ('$', '#', ','), you could do this with pyspark.sql.functions.regexp_replace().
df.select("A", f.regexp_replace(f.col("A"), "[\$#,]", "").alias("replaced")).show()
#+------------------+-------------+
#| A| replaced|
#+------------------+-------------+
#| $100,00| 10000|
#| #foobar| foobar|
#|foo, bar, #, and $|foo bar and |
#+------------------+-------------+
The pattern "[\$#,]" means match any of the characters inside the brackets. The $ has to be escaped because it has a special meaning in regex.
If someone need to do this in scala you can do this as below code:
val df = Seq(("Test$",19),("$#,",23),("Y#a",20),("ZZZ,,",21)).toDF("Name","age")
import org.apache.spark.sql.functions._
val df1 = df.withColumn("NewName",translate($"Name","$#,","xyz"))
display(df1)
You can see the output as below:
Related
I am leaving my question below as it was originally posted for the sake of future developers who run into this problem. The issue was resolved once I moved to Spark2.0 - I.e. the output was as I expected without making any chnages to my original code. Looks like some implementation difference in the 1.6 version I used at first.
I have Spark 1.6 Scala code that reads a TSV (CSV with tab delimiter) and writes it to TSV output (without changing the input - just filtering the input).
The input data has sometimes null values in the last columns of rows.
When I use the delimiter "," the output has trailing commas.
E.g.
val1, val2, val3,val4,val5
val1, val2, val3,,
but if I use tab (\t) as a delimiter the output does not include the trailing tabs. E.g (I am writing here TAB where \t appears):
val1 TAB val2 TAB val3 TAB val4 TAB val5
val1 TAB val2 TAB val3 <= **here I expected two more tabs (as with the comma delimiter)**
I also tried other delimiters and saw that when the delimiter is a white space character (e.g. the ' ' characted) the trailing delimeters are not in the output.
If I use other visible delimiter (e.g. the letter 'z') it works fine as with the comma separator and I have trailing delimiters.
I thought this might have to do with the options ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace but setting them both to false when writing didn't help either.
My code looks like this:
val df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter", "\t").load(params.inputPathS3)
df_filtered = df.filter(...)
df_filtered.write.format("com.databricks.spark.csv").option("delimiter", "\t").save(outputPath)
I also tried (as I wrote above):
df_filtered.write.format("com.databricks.spark.csv").option("delimiter", "\t").option("ignoreLeadingWhiteSpace", "false").option("ignoreTrailingWhiteSpace", "false").save(outputPath)
Below is a working example(with spark 1.6):
Input file (with some trailing spaces in the end):
1,2,3,,
scala> val df = sqlContext.read.option("ignoreLeadingWhiteSpace", "false").option("ignoreTrailingWhiteSpace", "false").format("com.databricks.spark.csv").option("delimiter", ",").load("path")
df: org.apache.spark.sql.DataFrame = [C0: string, C1: string, C2: string, C3: string, C4: string]
scala> df.show
+---+---+---+---+---+
| C0| C1| C2| C3| C4|
+---+---+---+---+---+
| 1| 2| 3| | |
+---+---+---+---+---+
scala> df.write.option("nullValue", "null").option("quoteMode", "ALL").mode("overwrite").format("com.databricks.spark.csv").option("delimiter", "\t").save("path")
scala> sqlContext.read.format("com.databricks.spark.csv").option("delimiter", "\t").load("path").show
+---+---+---+---+---+
| C0| C1| C2| C3| C4|
+---+---+---+---+---+
| 1| 2| 3| | |
+---+---+---+---+---+
Please refer: this for all options while reading, writing with databricks csv library.
I have a dataframe, where some column special_column contains values like one, two. My dataframe also has columns one_processed and two_processed.
I would like to add a new column my_new_column which values are taken from other columns from my dataframe, based on processed values from special_column. For example, if special_column == one I would like my_new_column to be set to one_processed.
I tried .withColumn("my_new_column", F.col(F.concat(F.col("special_column"), F.lit("_processed")))), but Spark complains that i cannot parametrize F.col with a column.
How could I get the string value of the concatenation, so that I can select the desired column?
from pyspark.sql.functions import when, col, lit, concat_ws
sdf.withColumn("my_new_column", when(col("special_column")=="one", col("one_processed"
).otherwise(concat_ws("_", col("special_column"), lit("processed"))
The easiest way in your case would be just a simple when/oterwise like:
>>> df = spark.createDataFrame([(1, 2, "one"), (1,2,"two")], ["one_processed", "two_processed", "special_column"])
>>> df.withColumn("my_new_column", F.when(F.col("special_column") == "one", F.col("one_processed")).otherwise(F.col("two_processed"))).show()
+-------------+-------------+--------------+-------------+
|one_processed|two_processed|special_column|my_new_column|
+-------------+-------------+--------------+-------------+
| 1| 2| one| 1|
| 1| 2| two| 2|
+-------------+-------------+--------------+-------------+
As far as I know there is no way to get a column value by name, as execution plan would depend on the data.
I am comparing a condition with pyspark join in my application by using substring function. This function is returning a column type instead of a value.
substring(trim(coalesce(df.col1)), 13, 3) returns
Column<b'substring(trim(coalesce(col1), 13, 3)'>
Tried with expr but still getting the same column type result
expr("substring(trim(coalesce(df.col1)),length(trim(coalesce(df.col1))) - 2, 3)")
I want to compare the values coming from substring to the value of another dataframe column. Both are of string type
pyspark:
substring(trim(coalesce(df.col1)), length(trim(coalesce(df.col1))) -2, 3) == df2["col2"]
lets say col1 = 'abcdefghijklmno'
The expected output of substring function should mno based on the above definition.
creating a sample dataframes to join
list1 = [('ABC','abcdefghijklmno'),('XYZ','abcdefghijklmno'),('DEF','abcdefghijklabc')]
df1=spark.createDataFrame(list1, ['col1', 'col2'])
list2 = [(1,'mno'),(2,'mno'),(3,'abc')]
df2=spark.createDataFrame(list2, ['col1', 'col2'])
import pyspark.sql.functions as f
creating a substring to read last n characters for three positions.
cond=f.substring(df1['col2'], -3, 3)==df2['col2']
newdf=df1.join(df2,cond)
>>> newdf.show()
+----+---------------+----+----+
|col1| col2|col1|col2|
+----+---------------+----+----+
| ABC|abcdefghijklmno| 1| mno|
| ABC|abcdefghijklmno| 2| mno|
| XYZ|abcdefghijklmno| 1| mno|
| XYZ|abcdefghijklmno| 2| mno|
| DEF|abcdefghijklabc| 3| abc|
+----+---------------+----+----+
I have a pyspark DataFrame
a = [
('Bob', 562),
('Bob',880),
('Bob',380),
('Sue',85),
('Sue',963)
]
df = spark.createDataFrame(a, ["Person", "Amount"])
I need to create a column that hashes the Amount and returns the amount. The problem is I can't use a UDF so I have used a mapping function.
df.rdd.map(lambda x: hash(x["Amount"]))
If you can't use udf you can use the map function, but as you've currently written it, there will only be one column. To keep all the columns, do the following:
df = df.rdd\
.map(lambda x: (x["Person"], x["Amount"], hash(str(x["Amount"]))))\
.toDF(["Person", "Amount", "Hash"])
df.show()
#+------+------+--------------------+
#|Person|Amount| Hash|
#+------+------+--------------------+
#| Bob| 562|-4340709941618811062|
#| Bob| 880|-7718876479167384701|
#| Bob| 380|-2088598916611095344|
#| Sue| 85| 7168043064064671|
#| Sue| 963|-8844931991662242457|
#+------+------+--------------------+
Note: In this case, hash(x["Amount"]) is not very interesting so I changed it to hash Amount converted to a string.
Essentially you have to map the row to a tuple containing all of the existing columns and add in the new column(s).
If your columns are too many to enumerate, you could also just add a tuple to the existing row.
df = df.rdd\
.map(lambda x: x + (hash(str(x["Amount"])),))\
.toDF(df.columns + ["Hash"])\
I should also point out that if hashing the values is your end goal, there is also a pyspark function pyspark.sql.functions.hash that can be used to avoid the serialization to rdd:
import pyspark.sql.functions as f
df.withColumn("Hash", f.hash("Amount")).show()
#+------+------+----------+
#|Person|Amount| Hash|
#+------+------+----------+
#| Bob| 562| 51343841|
#| Bob| 880|1241753636|
#| Bob| 380| 514174926|
#| Sue| 85|1944150283|
#| Sue| 963|1665082423|
#+------+------+----------+
This appears to use a different hashing algorithm than the python builtin.
How do you set the display precision in PySpark when calling .show()?
Consider the following example:
from math import sqrt
import pyspark.sql.functions as f
data = zip(
map(lambda x: sqrt(x), range(100, 105)),
map(lambda x: sqrt(x), range(200, 205))
)
df = sqlCtx.createDataFrame(data, ["col1", "col2"])
df.select([f.avg(c).alias(c) for c in df.columns]).show()
Which outputs:
#+------------------+------------------+
#| col1| col2|
#+------------------+------------------+
#|10.099262230352151|14.212583322380274|
#+------------------+------------------+
How can I change it so that it only displays 3 digits after the decimal point?
Desired output:
#+------+------+
#| col1| col2|
#+------+------+
#|10.099|14.213|
#+------+------+
This is a PySpark version of this scala question. I'm posting it here because I could not find an answer when searching for PySpark solutions, and I think it can be helpful to others in the future.
Round
The easiest option is to use pyspark.sql.functions.round():
from pyspark.sql.functions import avg, round
df.select([round(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#| col1| col2|
#+------+------+
#|10.099|14.213|
#+------+------+
This will maintain the values as numeric types.
Format Number
The functions are the same for scala and python. The only difference is the import.
You can use format_number to format a number to desired decimal places as stated in the official api document:
Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places, and returns the result as a string column.
from pyspark.sql.functions import avg, format_number
df.select([format_number(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#| col1| col2|
#+------+------+
#|10.099|14.213|
#+------+------+
The transformed columns would of StringType and a comma is used as a thousands separator:
#+-----------+--------------+
#| col1| col2|
#+-----------+--------------+
#|500,100.000|50,489,590.000|
#+-----------+--------------+
As stated in the scala version of this answer we can use regexp_replace to replace the , with any string you want
Replace all substrings of the specified string value that match regexp with rep.
from pyspark.sql.functions import avg, format_number, regexp_replace
df.select(
[regexp_replace(format_number(avg(c), 3), ",", "").alias(c) for c in df.columns]
).show()
#+----------+------------+
#| col1| col2|
#+----------+------------+
#|500100.000|50489590.000|
#+----------+------------+
just envelop the answer to a function witch only deal with float and double columns.
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
def dataframe_format_float(df: DataFrame, num_decimals=4) -> DataFrame:
r = []
for c in df.dtypes:
name, dtype = c[0], c[1]
if dtype in ['float', 'double']:
r.append(F.round(name, num_decimals).alias(name))
else:
r.append(name)
df = df.select(r)
return df