I have a pyspark dataframe, called df.
ONE LINE EXAMPLE:
df.take(1)
[Row(data=u'2016-12-25',nome=u'Mauro',day_type="SUN")]
I have a list of holidays day:
holydays=[u'2016-12-25',u'2016-12-08'....]
I want to switch day_type to "HOLIDAY" if "data" is in holydays list otherwise I want to leave day_type field as it is.
This is my non working tentative:
df=df.withColumn("day_type",when(col("data") in holydays, "HOLIDAY").otherwise(col("day_type")))
PySpark does not like the expression "in holydays".
It returns this error:
ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
Regarding your first question - you need isin:
spark.version
# u'2.2.0'
from pyspark.sql import Row
from pyspark.sql.functions import col, when
df=spark.createDataFrame([Row(data=u'2016-12-25',nome=u'Mauro',day_type="SUN")])
holydays=[u'2016-12-25',u'2016-12-08']
df.withColumn("day_type",when(col("data").isin(holydays), "HOLIDAY").otherwise(col("day_type"))).show()
# +----------+--------+-----+
# | data|day_type| nome|
# +----------+--------+-----+
# |2016-12-25| HOLIDAY|Mauro|
# +----------+--------+-----+
Regarding your second question - I don't see any issue:
df.withColumn("day_type",when(col("data")=='2016-12-25', "HOLIDAY").otherwise(col("day_type"))).filter("day_type='HOLIDAY'").show()
# +----------+--------+-----+
# | data|day_type| nome|
# +----------+--------+-----+
# |2016-12-25| HOLIDAY|Mauro|
# +----------+--------+-----+
BTW, it's a always a good idea to provide a little more than a single row of sample data...
Use isin function on column instead of using in clause to check if the value is present in a list. Sample code :
df=df.withColumn("day_type",when(df.data.isin(holydays), "HOLIDAY").otherwise(df.day_type)))
Related
Let's say I have a dataframe like -
target_id = [3733345, 3725312, 3717114, 3408996, 3354970]
test_df = spark.createDataFrame(target_id, IntegerType()).withColumnRenamed("value", "target_id")
I want to add random samples of values from this column to another column other_target_ids such that the output comes something like below:
target_id other_ids
3733345 [3731634, 3729995, 3728014, 3708332, 3720...
3725312 [3711541, 3726052, 3733763, 900056057, 371...
3717114 [3701718, 3713481, 3715433, 3714825, 3731...
3408996 [3405896, 3250400, 3237054, 3242492, 3256...
3354970 [3354969, 3347893, 3348168, 3353273, 3356...
I guest you could do that with a few steps to collect and filter ids. Check my sample below
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql.window import Window as W
(test_df
# collect all ids into a list
.withColumn('ids', F.collect_list('target_id').over(W.Window.partitionBy(F.lit(1))))
# remove `target_id` from the above list
.withColumn('ids', F.array_except('ids', F.array(F.col('target_id'))))
# shuffle the ids
.withColumn('ids', F.shuffle('ids'))
# "sampling" the ids by getting the first few items
.withColumn('ids', F.slice('ids', 1, 2))
# display
.show(10, False)
)
+---------+------------------+
|target_id|ids |
+---------+------------------+
|3733345 |[3408996, 3354970]|
|3725312 |[3408996, 3733345]|
|3717114 |[3408996, 3354970]|
|3408996 |[3733345, 3354970]|
|3354970 |[3733345, 3725312]|
+---------+------------------+
I'm trying to modify a column from my dataFrame by removing the suffix from all the rows under that column and I need it in Scala.
The values from the column have different lengths and also the suffix is different.
For example, I have the following values:
09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0
0978C74C69E8D559A62F860EA36ADF5E-28_3_1
0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1
0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1
22AEA8C8D403643111B781FE31B047E3-0_1_0_0
I need to remove everything after the "_" so that I can get the following values:
09E9894DB868B70EC3B55AFB49975390-0
0978C74C69E8D559A62F860EA36ADF5E-28
0C12FA1DAFA8BCD95E34EE70E0D71D10-0
0D075AA40CFC244E4B0846FA53681B4D
22AEA8C8D403643111B781FE31B047E3-0
As #werner pointed out in his comment, substring_index provides a simple solution to this. It is not necessary to wrap this in a call to selectExpr.
Whereas #AminMal has provided a working solution using a UDF, if a native Spark function can be used then this is preferable for performance.[1]
val df = List(
"09E9894DB868B70EC3B55AFB49975390-0_0_0_0_0",
"0978C74C69E8D559A62F860EA36ADF5E-28_3_1",
"0C12FA1DAFA8BCD95E34EE70E0D71D10-0_3_1",
"0D075AA40CFC244E4B0846FA53681B4D_0_1_0_1",
"22AEA8C8D403643111B781FE31B047E3-0_1_0_0"
).toDF("col0")
import org.apache.spark.sql.functions.{col, substring_index}
df
.withColumn("col0", substring_index(col("col0"), "_", 1))
.show(false)
gives:
+-----------------------------------+
|col0 |
+-----------------------------------+
|09E9894DB868B70EC3B55AFB49975390-0 |
|0978C74C69E8D559A62F860EA36ADF5E-28|
|0C12FA1DAFA8BCD95E34EE70E0D71D10-0 |
|0D075AA40CFC244E4B0846FA53681B4D |
|22AEA8C8D403643111B781FE31B047E3-0 |
+-----------------------------------+
[1] Is there a performance penalty when composing spark UDFs
I have one column in DataFrame with format =
'[{jsonobject},{jsonobject}]'. here length will be 2 .
I have to find length of this array and store it in another column.
I've only worked with pySpark, but the Scala solution would be similar. Assuming the column name is input:
from pyspark.sql import functions as f, types as t
json_schema = t.ArrayType(t.MapType(t.StringType(), t.StringType()))
df.select(f.size(f.from_json(df.input, json_schema)).alias("num_objects"))
I'm working with databricks on a notebook.
I have a column with numbers like this 103503119090884718216391506040
They are in string format. I can print them and read them easily.
For debugging purpose I need to be able to read them. However I also need to be able to apply them .sort() method. Casting them as IntegerType() return null value, casting them as double make them unreadable.
How can I convert them in a human readable format but at the same time where .sort() would work? Do I need to create two separate columns?
To make the column sortable, you could convert your column to DecimalType(precision, scale) (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.DecimalType.html#pyspark.sql.types.DecimalType). For this data type you can choose the possible value range the two arguments
from pyspark.sql import SparkSession, Row, types as T, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(string_column='103503119090884718216391506040'),
Row(string_column='103503119090884718216391506039'),
Row(string_column='90'),
])
(
df
.withColumn('decimal_column', F.col('string_column').cast(T.DecimalType(30,0)))
.sort('decimal_column')
.show(truncate=False)
)
# Output
+------------------------------+------------------------------+
|string_column |decimal_column |
+------------------------------+------------------------------+
|90 |90 |
|103503119090884718216391506039|103503119090884718216391506039|
|103503119090884718216391506040|103503119090884718216391506040|
|103503119090884718216391506041|103503119090884718216391506041|
+------------------------------+------------------------------+
Concerning "human readability" I'm not sure whether that helps, though.
I need to save a file delimited by "|~" characters but I get an error when I execute the command below. Can I save a file using multiple delimiters in Spark?
mydf1.coalesce(1).write.option("compression","none").format("csv").mode("Overwrite").option("delimiter","|~").save("my_hdfs_path")
// Error : pyspark.sql.utils.IllegalArgumentException: u'Delimiter cannot be more than one character: |~'
AFAIK, we are still waiting for an "official" solution, because the issue "support for multiple delimiter in Spark CSV read" is still open, and SparkSQL still rely on univocity-parsers. In univocity CSV settings, the CSV delimiter can only be a single character, which constrains both the parser (reader) and generator (writer).
Workarounds
Finding a universally fast and safe way to write as CSV is hard. But depends on your data size and the complexity of CSV contents (date format? currency? quoting?), we may find a shortcut. Following are just some, hopefully inspiring, thoughts...
write to CSV with special character (say ⊢) then substitute to |~.
(haven't been benchmarked, but IMO it's very hopeful to be the fastest)
df.coalesce(1).write.option("compression","none").option("delimiter", "⊢").mode("overwrite").csv("raw-output")
then post-process with (ideally locally) with, say sed
sed -i '.bak' 's/⊢/\|~/g' raw-output/*.csv
within PySpark, concatenate each row to a string, then write as a text file
(can be flexible to deal with locality and special needs -- with a bit more work)
d = [{'name': 'Alice', 'age': 1},{'name':'Bob', 'age':3}]
df = spark.createDataFrame(d, "name:string, age:int")
df.show()
#+-----+---+
#| name|age|
#+-----+---+
#|Alice| 1|
#| Bob| 3|
#+-----+---+
#udf
def mkstr(name, age):
"""
for example, the string field {name} should be quoted with `"`
"""
return '"{name}"|~{age}'.format(name=name, age=age)
# unparse a CSV row back to a string
df_unparsed = df.select(mkstr("name", "age").alias("csv_row"))
df_unparsed.show()
#+----------+
#| csv_row|
#+----------+
#|"Alice"|~1|
#| "Bob"|~3|
#+----------+
df_unparsed.coalesce(1).write.option("compression", "none").mode("overwrite").text("output")
numpy.savetxt allows multiple character as delimiter, so ...
(numpy has lots of builtins if you cares about precisions of floating numbers)
import pandas as pd
import numpy as np
# convert `Spark.DataFrame` to `Pandas.DataFrame`
df_pd = df.toPandas()
# use `numpy.savetxt` to save `Pandas.DataFrame`
np.savetxt("a-long-day.csv", df_pd, delimiter="|~", fmt="%s")
From Spark 3.0 We dont have this issue But if using prior version > Spark 2.3 This can be also used as solution. basically concatenating all column and fill null with blank and write the data with the desired delimiter along with the header . This will more generic solution than hardcoding.This allows to have the header also retained.
from pyspark.sql.functions import *
from pyspark.sql import functions as F
target_row_delimited = "|,"
df=df.select([col(c).cast("string") for c in df.columns])
df=df.na.fill("")
headername=target_row_delimited.join(df.columns)
df=df.withColumn(headername, concat_ws(target_row_delimited, *df.columns))
df.select(df[headername].write.format("csv").mode(modeval).option("quoteAll", "false").option("quote","\u0000").option("header", "true").save(tgt_path + "/")
In case we need to read with multiple delimiters , the following solution can be avialed
source_delimiter = "|_|"
headcal = spark.read.text(source_filename)
headers = headcal.take(1)[0]['value']
header_column = headers.split(source_delimiter)
df = sc.textFile(source_filename).map(lambda x: x.split(source_delimiter)).toDF(header_column)