NullPointerException with pyspark dataframe - pyspark

I have a pyspark dataframe that .show() indicates that everything is normal but .toPandas(), .count(), .write.parquet("abc/abc_pred.parquet") all result in NullPointerException. I cannot do anything with this dataframe. Any ideas how I can export this dataframe?

For your ref to create the data frame
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.appName('test').getOrCreate()
data = [(1, "siva", 100), (2, "siva2", 200),(3, "siva3", 300),(4, "siva4", 400),(5, "siva5", 500)]
schema = ['id', 'name', 'sallary']
df = spark.createDataFrame(data, schema=schema)
df.show()
+---+-----+-------+
| id| name|sallary|
+---+-----+-------+
| 1| siva| 100|
| 2|siva2| 200|
| 3|siva3| 300|
| 4|siva4| 400|
| 5|siva5| 500|
+---+-----+-------+
df.agg({"sallary": "max"}).withColumnRenamed('max(sallary)', 'max').show()
+---+
|max|
+---+
|500|
+---+

Related

How to creat a pyspark DataFrame inside of a loop?

How to creat a pyspark DataFrame inside of a loop? In this loop in each iterate I am printing 2 values print(a1,a2). now I want to store all these value in a pyspark dataframe.
Initially, before the loop, you could create an empty dataframe with your preferred schema. Then, create a new df for each loop with the same schema and union it with your original dataframe. Refer the code below.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField('a1', StringType(), True),
StructField('a2', StringType(), True)
])
df = spark.createDataFrame([],schema)
for i in range(1,5):
a1 = i
a2 = i+1
newRow = spark.createDataFrame([(a1,a2)], schema)
df = df.union(newRow)
print(df.show())
This gives me the below result where the values are appended to the df in each loop.
+---+---+
| a1| a2|
+---+---+
| 1| 2|
| 2| 3|
| 3| 4|
| 4| 5|
+---+---+

How is the string column in DataFrame split into multiple columns when Spark Structed Streaming

This is the current code:
from pyspark.sql import SparkSession
park_session = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
lines = spark_session\
.readStream\
.format("socket")\
.option("host", "127.0.0.1")\
.option("port", 9998)\
.load()
The 'lines' looks like this:
+-------------+
| value |
+-------------+
| a,b,c |
+-------------+
But I want to look like this:
+---+---+---+
| a | b | c |
+---+---+---+
I tried using the 'split()' method, but it didn't work. You could only split each string into a list in a column, not into multiple columns
What should I do?
Split the value column and by accessing array index (or) element_at(from spark-2.4) (or) getItem() functions to create new columns.
from pyspark.sql.functions import *
lines.withColumn("tmp",split(col("value"),',')).\
withColumn("col1",col("tmp")[0]).\
withColumn("col2",col("tmp").getItem(1)).\
withColumn("col3",element_at(col("tmp"),3))
drop("tmp","value").\
show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| a| b| c|
#+----+----+----+
from pyspark.sql.functions import *
import pyspark.sql.functions as f
from pyspark.sql import SparkSession
spark_session = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
lines = spark_session\
.readStream\
.format("socket")\
.option("host", "127.0.0.1")\
.option("port", 9998)\
.load()
split_col = f.split(lines['value'], ",")
df = df.withColumn('col1', split_col.getItem(0))
df = df.withColumn('col2', split_col.getItem(1))
df = df.withColumn('col2', split_col.getItem(2))
df.show()
Incase you have different numbers of delimiters and not just 3 for each row , you can use the below:
Input:
+-------+
|value |
+-------+
|a,b,c |
|d,e,f,g|
+-------+
Solution
import pyspark.sql.functions as F
max_size = df.select(F.max(F.length(F.regexp_replace('value','[^,]','')))).first()[0]
out = df.select([F.split("value",',')[x].alias(f"Col{x+1}") for x in range(max_size+1)])
Output
out.show()
+----+----+----+----+
|Col1|Col2|Col3|Col4|
+----+----+----+----+
| a| b| c|null|
| d| e| f| g|
+----+----+----+----+

show all the matched string in pyspark dataframe

I wanted to show all the filtered results of similar matched string.
codes:
# Since most of the stackoverflow questionaire-s and also answerer-s are all super SMART and leave out all the necessary import libraries and required codes before using pyspark so that the readers can crack their minds in researching more instead of getting direct answer, I share the codes from beginning as below in order for future reviewers.
# Import libraries
from pyspark.sql import SparkSession
from pyspark import SparkContext
import pandas as pd
import numpy as np
# Initiate the session
spark = SparkSession\
.builder\
.appName('Operations')\
.getOrCreate()
# sc = SparkContext()
sc =SparkContext.getOrCreate()
# Create dataframe 1
sdataframe_temp = spark.createDataFrame([
(1,2,'3'),
(2,2,'yes')],
['a', 'b', 'c']
)
# Create Dataframe 2
sdataframe_temp2 = spark.createDataFrame([
(4,6,'yes'),
(5,7,'yes')],
['a', 'b', 'c']
)
# Combine dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)
# Filter out the columns based on respective rules
sdataframe_temp\
.filter(sdataframe_union_1_2['c'] == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
Output:
+---+---+
| a| b|
+---+---+
| 2| 2|
+---+---+
Expected output:
+---+---+
| a| b|
+---+---+
| 2| 2|
+---+---+
| 4| 6|
+---+---+
| 5| 7|
+---+---+
Can anyone please give some suggestions or improvement?
Here's a way using unionByName:
df = (sdataframe_temp1
.unionByName(sdataframe_temp2)
.where("c == 'yes'")
.drop('c'))
df.show()
+---+---+
| a| b|
+---+---+
| 2| 2|
| 4| 6|
| 5| 7|
+---+---+
you should change last line of code. for col function you should import from pyspark.sql.functions
from pyspark.sql.functions import *
sdataframe_union_1_2\
.filter(col('c') == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
or
sdataframe_union_1_2\
.filter(sdataframe_union_1_2['c'] == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.show()
you have to select data from sdataframe_union_1_2 and you are selecting from sdataframe_temp that's why you are getting one record.

How to create a simple DataFrame with random values

I am trying to create a very simple DataFrame with for example 3 columns and 3 rows.
I would like to have something like this:
+------+---+-----+
|nameID|age| Code|
+------+---+-----+
|2123 | 80| 4553|
|65435 | 10| 5454|
+------+---+-----+
How can I create that Dataframe in Scala (is an example).
I have the next program:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
object ejemploApp extends App{
val schema = StructType(List(
StructField("name", LongType, true),
StructField("pandas", LongType, true),
StructField("id", LongType, true)))
}
val outputDF = sqlContext.createDataFrame(sc.emptyRDD, schema)
First problem:
It is throwing error in the outputDF that cannot resolve symbol schema.
Second problem:
How can I add the random numbers to the DataFrame?
I would do something like this:
val nRows = 10
import scala.util.Random
val df = (1 to nRows)
.map(_ => (Random.nextLong,Random.nextLong,Random.nextLong))
.toDF("nameID","age","Code")
+--------------------+--------------------+--------------------+
| nameID| age| Code|
+--------------------+--------------------+--------------------+
| 5805854653225159387|-1935762756694500432| 1365584391661863428|
| 4308593891267308529|-1117998169834014611| 366909655761037357|
|-6520321841013405169| 7356990033384276746| 8550003986994046206|
| 6170542655098268888| 1233932617279686622| 7981198094124185898|
|-1561157245868690538| 1971758588103543208| 6200768383342183492|
|-8160793384374349276|-6034724682920319632| 6217989507468659178|
| 4650572689743320451| 4798386671229558363|-4267909744532591495|
| 1769492191639599804| 7162442036876679637|-4756245365203453621|
| 6677455911726550485| 8804868511911711123|-1154102864413343257|
|-2910665375162165247|-7992219570728643493|-3903787317589941578|
+--------------------+--------------------+--------------------+
Of course, the age isn't very realistic, but you can change your random numbers as you wish (i.e. using scalas modulo function and absolute value), you could e.g.
val df = (1 to nRows)
.map(id => (id.toLong,Math.abs(Random.nextLong % 100L),Random.nextLong))
.toDF("nameID","age","Code")
+------+---+--------------------+
|nameID|age| Code|
+------+---+--------------------+
| 1| 17| 7143235115334699492|
| 2| 83|-3863778506510275412|
| 3| 31|-3839786144396379186|
| 4| 40| 8057989112338559775|
| 5| 67| 7601061291211506255|
| 6| 71| 7393782421106239325|
| 7| 43| 28349510524075085|
| 8| 50| 539042255545625624|
| 9| 41|-8654000375112432924|
| 10| 82|-1300111870445007499|
+------+---+--------------------+
EDIT: make sure you have the implicits imported:
Spark 1.6:
import sqlContext.implicits._
Spark 2:
import sparkSession.implicits._

How to convert a dataframe column to sequence

I have a dataframe as below:
+-----+--------------------+
|LABEL| TERM|
+-----+--------------------+
| 4| inhibitori_effect|
| 4| novel_therapeut|
| 4| antiinflammator...|
| 4| promis_approach|
| 4| cell_function|
| 4| cell_line|
| 4| cancer_cell|
I want to create a new dataframe by taking all terms as sequence so that I can use them with Word2vec. That is:
+-----+--------------------+
|LABEL| TERM|
+-----+--------------------+
| 4| inhibitori_effect, novel_therapeut,..., cell_line |
As a result I want to apply this sample code as given here: https://spark.apache.org/docs/latest/ml-features.html#word2vec
So far I have tried to convert df to RDD and map it. And then I could not manage to re-convert it to a df.
Thanks in advance.
EDIT:
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
val sc = new SparkContext(conf)
val sqlContext: SQLContext = new HiveContext(sc)
val df = sqlContext.load("jdbc",Map(
"url" -> "jdbc:oracle:thin:...",
"dbtable" -> "table"))
df.show(20)
df.groupBy($"label").agg(collect_list($"term").alias("term"))
You can use collect_list or collect_set functions:
import org.apache.spark.sql.functions.{collect_list, collect_set}
df.groupBy($"label").agg(collect_list($"term").alias("term"))
In Spark < 2.0 it requires HiveContext and in Spark 2.0+ you have to enable hive support in SessionBuilder. See Use collect_list and collect_set in Spark SQL