pyspark: Apply functions on every fields of the RDD - pyspark

I created dataframe using df1 = HiveContext(sc).sql("from xxx.table1 select * ") Converted to RDD df1.rdd
I have to apply transformations at field level in a row. How do I do it?
I tried the below code:
df2 = rdd1.map(lambda row:
Row(row.fld1,
row.fld2.replace("'", "''").replace("\\","\\\\").strip(),
row.fld3.toLowerCase
)
)
I get error
AttributeError: 'unicode' object has no attribute toLowerCase/replace
Could you help?

Replace
row.fld3.toLowerCase
by
row.fld3.lower()

Related

How to convert a 'datetime' column

In my PySpark dataframe, I have a column 'TimeStamp' which is in DateTime format. I want to covert that to 'Date' format and then use that in the 'GroupBy'.
df = spark.sql("SELECT * FROM `myTable`")
df.filter((df.somthing!="thing"))
df.withColumn('MyDate', col('Timestamp').cast('date')
df.groupBy('MyDate').count().show()
But I get this error:
cannot resolve 'MyDate' given input columns:
Can you please help me with this ?
each time you do df. you are creating a new dataframe.
df was only initialized in your first line of code, so that dataframe object does not have the new column MyDate.
you can look at the id() of each object to see
df = spark.sql("SELECT * FROM `myTable`")
print(id(df))
print(id(df.filter(df.somthing!="thing")))
this is correct syntax to chain operations
df = spark.sql("SELECT * FROM myTable")
df = (df
.filter(df.somthing != "thing")
.withColumn('MyDate', col('Timestamp').cast('date'))
.groupBy('MyDate').count()
)
df.show(truncate=False)
UPDATE: this is a better way to write it
df = (
spark.sql(
"""
SELECT *
FROM myTable
""")
.filter(col("something") != "thing")
.withColumn("MyDate", col("Timestamp").cast("date"))
.groupBy("MyDate").count()
)

spark dataframe filter and select

I have a spark scala dataframe and need to filter the elements based on condition and select the count.
val filter = df.groupBy("user").count().alias("cnt")
val **count** = filter.filter(col("user") === ("subscriber").select("cnt")
The error i am facing is value select is not a member of org.apache.spark.sql.Column
Also for some reasons count is Dataset[Row]
Any thoughts to get the count in a single line?
DataSet[Row] is DataFrame
RDD[Row] is DataFrame so no need to worry.. its dataframe
see this for better understanding... Difference between DataFrame, Dataset, and RDD in Spark
Regarding select is not a member of org.apache.spark.sql.Column its purely compile error.
val filter = df.groupBy("user").count().alias("cnt")
val count = filter.filter (col("user") === ("subscriber"))
.select("cnt")
will work since you are missing ) braces which is closing brace for filter.
You are missing ")" before .select, Please check below code.
Column class don't have .select method, you have to invoke select on Dataframe.
val filter = df.groupBy("user").count().alias("cnt")
val **count** = filter.filter(col("user") === "subscriber").select("cnt")

how to use createDataFrame to create a pyspark dataframe?

I know this is probably to be a stupid question. I have the following code:
from pyspark.sql import SparkSession
rows = [1,2,3]
df = SparkSession.createDataFrame(rows)
df.printSchema()
df.show()
But I got an error:
createDataFrame() missing 1 required positional argument: 'data'
I don't understand why this happens because I already supplied 'data', which is the variable rows.
Thanks
You have to create SparkSession instance using the build pattern and use it for creating dataframe, check
https://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession
spark= SparkSession.builder.getOrCreate()
Below are the steps to create pyspark dataframe using createDataFrame
Create sparksession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
Create data and columns
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
Creating DataFrame from RDD
rdd = spark.sparkContext.parallelize(data)
df= spark.createDataFrame(rdd).toDF(*columns)
the second approach, Directly creating dataframe
df2 = spark.createDataFrame(data).toDF(*columns)
Try
row = [(1,), (2,), (3,)]
?
If I am not wrong createDataFrame() takes 2 lists as input: first list is the data and second list is the column names. The data must be a lists of list of tuples, where each tuple is a row of the dataframe.

null pointer exception while converting dataframe to list inside udf

I am reading 2 different .csv files which has only column as below:
val dF1 = sqlContext.read.csv("some.csv").select($"ID")
val dF2 = sqlContext.read.csv("other.csv").select($"PID")
trying to search if dF2("PID") exists in dF1("ID"):
val getIdUdf = udf((x:String)=>{dF1.collect().map(_(0)).toList.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
This gives me null pointer exception.
but if I convert dF1 outside and use list in udf it works:
val dF1 = sqlContext.read.csv("some.csv").select($"ID").collect().map(_(0)).toList
val getIdUdf = udf((x:String)=>{dF1.contains(x)})
val dfFinal = dF2.withColumn("hasId", getIdUdf($"PID"))
I know I can use join to get this done but want to know what is the reason of null pointer exception here.
Thanks.
Please check this question about accessing dataframe inside the transformation of another dataframe. This is exactly what you are doing with your UDF, and this is not possible in spark. Solution is either to use join, or collect outside of transformation and broadcast.

Filter out rows with NaN values for certain column

I have a dataset and in some of the rows an attribute value is NaN. This data is loaded into a dataframe and I would like to only use the rows which consist of rows where all attribute have values. I tried doing it via sql:
val df_data = sqlContext.sql("SELECT * FROM raw_data WHERE attribute1 != NaN")
I tried several variants on this, but I can't seem to get it working.
Another option would be to transform it to a RDD and then filter it, since filtering this dataframe to check if a attribute isNaN , does not work.
I know you accepted the other answer, but you can do it without the explode (which should perform better than doubling your DataFrame size).
Prior to Spark 1.6, you could use a udf like this:
def isNaNudf = udf[Boolean,Double](d => d.isNaN)
df.filter(isNaNudf($"value"))
As of Spark 1.6, you can now use the built-in SQL function isnan() like this:
df.filter(isnan($"value"))
Here is some sample code that shows you my way of doing it -
import sqlContext.implicits._
val df = sc.parallelize(Seq((1, 0.5), (2, Double.NaN))).toDF("id", "value")
val df2 = df.explode[Double, Boolean]("value", "isNaN")(d => Seq(d.isNaN))
df will have -
df.show
id value
1 0.5
2 NaN
while doing filter on df2 will give you what you want -
df2.filter($"isNaN" !== true).show
id value isNaN
1 0.5 false
This works:
where isNaN(tau_doc) = false
e.g.
val df_data = sqlContext.sql("SELECT * FROM raw_data where isNaN(attribute1) = false")