From some brief testing, it appears that the column drop function for pyspark dataframes is not case sensitive, eg.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import sys
sparkSession = SparkSession.builder.appName("my-session").getOrCreate()
dff = sparkSession.createDataFrame([(10,123), (14,456), (16,678)], ["age", "AGE"])
>>> dff.show()
+---+---+
|age|AGE|
+---+---+
| 10|123|
| 14|456|
| 16|678|
+---+---+
>>> dff.drop("AGE")
DataFrame[]
>>> dff_dropped = dff.drop("AGE")
>>> dff_dropped.show()
++
||
++
||
||
||
++
"""
What I'd like to see here is:
+---+
|age|
+---+
| 10|
| 14|
| 16|
+---+
"""
Is there a way to drop dataframe columns in a case sensitive way? (Have seen some comments related to something like this in spark JIRA discussions, but was looking for something at only applied to the drop() operation in an ad hoc way (not a global / persistent setting)).
#Add this before using drop
sqlContext.sql("set spark.sql.caseSensitive=true")
You need to set casesensitivity as true if you have two columns having
same name
Related
I'm new to Pyspark and trying to transform data
Given dataframe
Col1
A=id1a A=id2a B=id1b C=id1c B=id2b
D=id1d A=id3a B=id3b C=id2c
A=id4a C=id3c
Required:
A B C
id1a id1b id1c
id2a id2b id2c
id3a id3b id3b
id4a null null
I have tried pivot, but that gives first value.
There might be a better way , however an approach is splitting the column on spaces to create array of the entries and then using higher order functions(spark 2.4+) to split on the '=' for each entry in the splitted array .Then explode and create 2 columns one with the id and one with the value. Then we can assign a row number to each partition and groupby then pivot:
import pyspark.sql.functions as F
df1 = (df.withColumn("Col1",F.split(F.col("Col1"),"\s+")).withColumn("Col1",
F.explode(F.expr("transform(Col1,x->split(x,'='))")))
.select(F.col("Col1")[0].alias("cols"),F.col("Col1")[1].alias("vals")))
from pyspark.sql import Window
w = Window.partitionBy("cols").orderBy("cols")
final = (df1.withColumn("Rnum",F.row_number().over(w)).groupBy("Rnum")
.pivot("cols").agg(F.first("vals")).orderBy("Rnum"))
final.show()
+----+----+----+----+----+
|Rnum| A| B| C| D|
+----+----+----+----+----+
| 1|id1a|id1b|id1c|id1d|
| 2|id2a|id2b|id2c|null|
| 3|id3a|id3b|id3c|null|
| 4|id4a|null|null|null|
+----+----+----+----+----+
this is how df1 looks like after the transformation:
df1.show()
+----+----+
|cols|vals|
+----+----+
| A|id1a|
| A|id2a|
| B|id1b|
| C|id1c|
| B|id2b|
| D|id1d|
| A|id3a|
| B|id3b|
| C|id2c|
| A|id4a|
| C|id3c|
+----+----+
May be I don't know the full picture, but the data format seems to be strange. If nothing can be done at the data source, then some collects, pivots and joins will be needed. Try this.
import pyspark.sql.functions as F
test = sqlContext.createDataFrame([('A=id1a A=id2a B=id1b C=id1c B=id2b',1),('D=id1d A=id3a B=id3b C=id2c',2),('A=id4a C=id3c',3)],schema=['col1','id'])
tst_spl = test.withColumn("item",(F.split('col1'," ")))
tst_xpl = tst_spl.select(F.explode("item"))
tst_map = tst_xpl.withColumn("key",F.split('col','=')[0]).withColumn("value",F.split('col','=')[1]).drop('col')
#%%
tst_pivot = tst_map.groupby(F.lit(1)).pivot('key').agg(F.collect_list(('value'))).drop('1')
#%%
tst_arr = [tst_pivot.select(F.posexplode(coln)).withColumnRenamed('col',coln) for coln in tst_pivot.columns]
tst_fin = reduce(lambda df1,df2:df1.join(df2,on='pos',how='full'),tst_arr).orderBy('pos')
tst_fin.show()
+---+----+----+----+----+
|pos| A| B| C| D|
+---+----+----+----+----+
| 0|id3a|id3b|id1c|id1d|
| 1|id4a|id1b|id2c|null|
| 2|id1a|id2b|id3c|null|
| 3|id2a|null|null|null|
+---+----+----+----+----
Is there a way of counting approximately after a group by on an sql dataset in Spark? Or more generally, what is the fastest way of group by counting in Spark?
I am not sure you are looking for these...
approx_count_distinct and countDistinct
are the things available wtih spark api
there is no approx_count_groupby
Examples :
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
object CountAgg extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder.appName(getClass.getName)
.master("local[*]").getOrCreate
import spark.implicits._
import org.apache.spark.sql.functions._
val df =
Seq(("PAGE1","VISITOR1"),
("PAGE1","VISITOR1"),
("PAGE2","VISITOR1"),
("PAGE2","VISITOR2"),
("PAGE2","VISITOR1"),
("PAGE1","VISITOR1"),
("PAGE1","VISITOR2"),
("PAGE1","VISITOR1"),
("PAGE1","VISITOR2"),
("PAGE1","VISITOR1"),
("PAGE2","VISITOR2"),
("PAGE1","VISITOR3")
).toDF("Page", "Visitor")
println("groupby abd count example ")
df.groupBy($"page").agg(count($"visitor").as("count")).show
println("group by and countDistinct")
df.select("page","visitor")
.groupBy('page)
.agg( countDistinct('visitor)).show
println("group by and approx_count_distinct")
df.select("page","visitor")
.groupBy('page)
.agg( approx_count_distinct('visitor)).show
}
Result
+-----+-----+
| page|count|
+-----+-----+
|PAGE2| 4|
|PAGE1| 8|
+-----+-----+
group by and countDistinct
+-----+-----------------------+
| page|count(DISTINCT visitor)|
+-----+-----------------------+
|PAGE2| 2|
|PAGE1| 3|
+-----+-----------------------+
group by and approx_count_distinct
[2020-04-06 01:04:24,488] WARN Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. (org.apache.spark.util.Utils:66)
+-----+------------------------------+
| page|approx_count_distinct(visitor)|
+-----+------------------------------+
|PAGE2| 2|
|PAGE1| 3|
+-----+------------------------------+
I'm trying in vain to use a Pyspark substring function inside of an UDF. Below is my code snippet -
from pyspark.sql.functions import substring
def my_udf(my_str):
try:
my_sub_str = substring(my_str,1, 2)
except Exception:
pass
else:
return (my_sub_str)
apply_my_udf = udf(my_udf)
df = input_data.withColumn("sub_str", apply_my_udf(input_data.col0))
The sample data is-
ABC1234
DEF2345
GHI3456
But when I print the df, I don't get any value in the new column "sub_str" as shown below -
[Row(col0='ABC1234', sub_str=None), Row(col0='DEF2345', sub_str=None), Row(col0='GHI3456', sub_str=None)]
Can anyone please let me know what I'm doing wrong?
You don't need a udf to use substring, here's a cleaner and faster way:
>>> from pyspark.sql import functions as f
>>> df.show()
+-------+
| data|
+-------+
|ABC1234|
|DEF2345|
|GHI3456|
+-------+
>>> df.withColumn("sub_str", f.substring("data", 1, 2)).show()
+-------+-------+
| data|sub_str|
+-------+-------+
|ABC1234| AB|
|DEF2345| DE|
|GHI3456| GH|
+-------+-------+
If you need to use udf for that, you could also try something like:
input_data = spark.createDataFrame([
(1,"ABC1234"),
(2,"DEF2345"),
(3,"GHI3456")
], ("id","col0"))
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
udf1 = udf(lambda x:x[0:2],StringType())
df.withColumn('sub_str',udf1('col0')).show()
+---+-------+-------+
| id| col0|sub_str|
+---+-------+-------+
| 1|ABC1234| AB|
| 2|DEF2345| DE|
| 3|GHI3456| GH|
+---+-------+-------+
However, as Mohamed Ali JAMAOUI wrote - you could do without udf easily here.
I find it hard to understand the difference between these two methods from pyspark.sql.functions as the documentation on PySpark official website is not very informative. For example the following code:
import pyspark.sql.functions as F
print(F.col('col_name'))
print(F.lit('col_name'))
The results are:
Column<b'col_name'>
Column<b'col_name'>
so what are the difference between the two and when should I use one and not the other?
The doc says:
col:
Returns a Column based on the given column name.
lit:
Creates a Column of literal value
Say if we have a data frame as below:
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField('A', StringType(), True)])
>>> df = spark.createDataFrame([("a",), ("b",), ("c",)], schema)
>>> df.show()
+---+
| A|
+---+
| a|
| b|
| c|
+---+
If using col to create a new column from A:
>>> df.withColumn("new", F.col("A")).show()
+---+---+
| A|new|
+---+---+
| a| a|
| b| b|
| c| c|
+---+---+
So col grabs an existing column with the given name, F.col("A") is equivalent to df.A or df["A"] here.
If using F.lit("A") to create the column:
>>> df.withColumn("new", F.lit("A")).show()
+---+---+
| A|new|
+---+---+
| a| A|
| b| A|
| c| A|
+---+---+
While lit will create a constant column with the given string as the values.
Both of them return a Column object but the content and meaning are different.
To explain in a very succinct manner, col is typically used to refer to an existing column in a DataFrame, as opposed to lit which is typically used to set the value of a column to a literal
To illustrate with an example:
Assume i have a DataFrame df containing two columns of IntegerType, col_a and col_b
If i wanted a column total which were the sum of the two columns:
df.withColumn('total', col('col_a') + col('col_b'))
Instead of i wanted a column fixed_val having the value "Hello" for all rows of the DataFrame df:
df.withColumn('fixed_val', lit('Hello'))
I am learning to work with Apache Spark (Scala) and still figuring out how things work out here
I am trying to achieve a simple task of
Finding max of column
Subtract each value of the column from this max and create a new column
The code I am using is
import org.apache.spark.sql.functions._
val training = sqlContext.createDataFrame(Seq(
(10),
(13),
(14),
(21)
)).toDF("Values")
val training_max = training.withColumn("Val_Max",training.groupBy().agg(max("Values"))
val training_max_sub = training_max.withColumn("Subs",training_max.groupBy().agg(col("Val_Max")-col("Values) ))
However I am getting a lot of errors. I am more or less fluent in R and had I been doing the same task my code would have been:
library(dplyr)
new_data <- training %>%
mutate(Subs= max(Values) - Values)
Here is a solution using window functions. You'll need a HiveContext to use them
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val training = sc.parallelize(Seq(10,13,14,21)).toDF("values")
training.withColumn("subs",
max($"values").over(Window.partitionBy()) - $"values").show
Which produces the expected output :
+------+----+
|values|subs|
+------+----+
| 10| 11|
| 13| 8|
| 14| 7|
| 21| 0|
+------+----+