Convert Dstream to dataframe using pyspark - pyspark

How i can convert a DStream to an dataframe?
here is my actual code
localhost = "127.0.0.1"
addresses = [(localhost, 9999)]
schema = ['event', 'id', 'time','occurence']
flumeStream = FlumeUtils.createPollingStream(ssc, addresses)
counts = flumeStream.map(lambda line: str(line).split(",")) \
.filter(lambda line: len(line)>1) \
.map(lambda line: (line[29],line[30],line[67],1)) \
.foreachRDD(lambda rdd: sqlContext.createDataFrame(rdd))
counts.show()
ssc.start()
ssc.awaitTerminationOrTimeout(62)
ssc.stop()
it gives me the following error:
AttributeError: 'NoneType' object has no attribute 'show'

Convert your DStream to RDD and then to DataFrame, ie dstrea.rdd.to_df

Related

How do I stop a pyspark dataframe from changing to a list?

I start with a pyspark dataframe and gets converted to a list after I use .take() on it. How can I keep it a pyspark dataframe?
df1 = Ce_clean
print(type(df1))
df1 = df1.take(1000)
print(type(df1))
<class 'pyspark.sql.dataframe.DataFrame'>
<class 'list'>
You can either convert the RDD/list to df or use limit(n)
df2 = spark.createDataFrame(df1.take(100))
type(df2)
<class 'pyspark.sql.dataframe.DataFrame'>
or
df3 = df1.limit(100)
type(df3)
<class 'pyspark.sql.dataframe.DataFrame'>

SparkDataFrame.dtypes fails if a column has special chars..how to bypass and read the csv and inferschema

Inferring Schema of a Spark Dataframe throws error if the csv file has column with special chars..
Test sample
foo.csv
id,comment
1, #Hi
2, Hello
spark = SparkSession.builder.appName("footest").getOrCreate()
df= spark.read.load("foo.csv", format="csv", inferSchema="true", header="true")
print(df.dtypes)
raise ValueError("Could not parse datatype: %s" % json_value)
I found comment from Dat Tran on inferSchema in spark csv package how to resolve this...cann't we still inferschema before dataclean?
Use it like this:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').enableHiveSupport().getOrCreate()
df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("test19.csv")
print(df.dtypes)
Output:
[('id', 'int'), ('comment', 'string')]

In Spark, how to write header in a file, if there is no row in a dataframe?

I want to write a header in a file if there is no row in dataframe, Currently when I write an empty dataframe to a file then file is created but it does not have header in it.
I am writing dataframe using these setting and command:
Dataframe.repartition(1) \
.write \
.format("com.databricks.spark.csv") \
.option("ignoreLeadingWhiteSpace", False) \
.option("ignoreTrailingWhiteSpace", False) \
.option("header", "true") \
.save('/mnt/Bilal/Dataframe');
I want the header row in the file, even if there is no data row in a dataframe.
if you want to have just header file. you can use fold left to create each column with white space and save that as your csv. I have not used pyspark but this is how it can be done in scala. majority of the code should be reusable you will have to just work on converting it to pyspark
val path ="/user/test"
val newdf=df.columns.foldleft(df){(tempdf,cols)=>
tempdf.withColumn(cols, lit(""))}
create a method for writing the header file
def createHeaderFile(headerFilePath: String, colNames: Array[String]) {
//format header file path
val fileName = "yourfileName.csv"
val headerFileFullName = "%s/%s".format(headerFilePath, fileName)
val hadoopConfig = new Configuration()
val fileSystem = FileSystem.get(hadoopConfig)
val output = fileSystem.create(new Path(headerFileFullName))
val writer = new PrintWriter(output)
for (h <- colNames) {
writer.write(h + ",")
}
writer.write("\n")
writer.close()
}
call it on your DF
createHeaderFile(path, newdf.columns)
I had the same problem with you, in Pyspark. When dataframe was empty (e.g after a .filter() transformation) then the output was one empty csv without header.
So, I created a custom method which checks if the ouput CSVs is one empty CSV. If yes, then it only adds the header.
import glob
import csv
def add_header_in_one_empty_csv(exported_path, columns):
list_of_csv_files = glob.glob(os.path.join(exported_path, '*.csv'))
if len(list_of_csv_files) == 1:
csv_file = list_of_csv_files[0]
with open(csv_file, 'a') as f:
if f.readline() == b'':
header = ','.join(columns)
f.write(header)
Example:
# Create a dummy Dataframe
df = spark.createDataFrame([(1,2), (1, 4), (3, 2), (1, 4)], ("a", "b"))
# Filter in order to create an empty Dataframe
filtered_df = df.filter(df['a']>10)
# Write the df without rows and no header
filtered_df.write.csv('output.csv', header='true')
# Add the header
add_header_in_one_empty_csv('output.csv', filtered_df.columns)
Same problem occurred to me. What I did is to use pandas for storing empty dataframes instead.
if df.count() == 0:
df.coalesce(1).toPandas().to_csv(join(output_folder, filename_output), index=False)
else:
df.coalesce(1).write.format("csv").option("header","true").mode('overwrite').save(join(output_folder, filename_output))

how to use createDataFrame to create a pyspark dataframe?

I know this is probably to be a stupid question. I have the following code:
from pyspark.sql import SparkSession
rows = [1,2,3]
df = SparkSession.createDataFrame(rows)
df.printSchema()
df.show()
But I got an error:
createDataFrame() missing 1 required positional argument: 'data'
I don't understand why this happens because I already supplied 'data', which is the variable rows.
Thanks
You have to create SparkSession instance using the build pattern and use it for creating dataframe, check
https://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession
spark= SparkSession.builder.getOrCreate()
Below are the steps to create pyspark dataframe using createDataFrame
Create sparksession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
Create data and columns
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
Creating DataFrame from RDD
rdd = spark.sparkContext.parallelize(data)
df= spark.createDataFrame(rdd).toDF(*columns)
the second approach, Directly creating dataframe
df2 = spark.createDataFrame(data).toDF(*columns)
Try
row = [(1,), (2,), (3,)]
?
If I am not wrong createDataFrame() takes 2 lists as input: first list is the data and second list is the column names. The data must be a lists of list of tuples, where each tuple is a row of the dataframe.

pyspark: Apply functions on every fields of the RDD

I created dataframe using df1 = HiveContext(sc).sql("from xxx.table1 select * ") Converted to RDD df1.rdd
I have to apply transformations at field level in a row. How do I do it?
I tried the below code:
df2 = rdd1.map(lambda row:
Row(row.fld1,
row.fld2.replace("'", "''").replace("\\","\\\\").strip(),
row.fld3.toLowerCase
)
)
I get error
AttributeError: 'unicode' object has no attribute toLowerCase/replace
Could you help?
Replace
row.fld3.toLowerCase
by
row.fld3.lower()