Confusion on types of Spark RDDs

Confusion on types of Spark RDDs - pyspark

I am just learning Spark and started with RDDs and now moving on to DataFrames. In my current pyspark project, I am reading an S3 file into an RDD and running some simple transformations on them. Here is the code.
segmentsRDD = sc.textFile(fileLocation). \
filter(lambda line: line.split(",")[6] in INCLUDE_SITES). \
filter(lambda line: line.split(",")[2] not in EXCLUDE_MARKETS). \
filter(lambda line: "null" not in line). \
map(splitComma). \
filter(lambda line: line.split(",")[5] == '1')
SplitComma is a function that does some date calculations on the row data and return 10 comma-delimited fields back. Once I get that I run the last filter as shown to only pickup rows where value in field [5] = 1. So far everything is fine.
Next, I would like to convert the segmentsRDD to DF with schema as shown below.
interim_segmentsDF = segmentsRDD.map(lambda x: x.split(",")).toDF("itemid","market","itemkey","start_offset","end_offset","time_shifted","day_shifted","tmsmarketid","caption","itemstarttime")
But I get an error about unable to convert a "pyspark.rdd.PipelinedRDD" to DataFrame. Can you please explain the difference between "pyspark.rdd.PipelinedRDD" and "row RDD"? I am attempting to convert to DF with a schema as shown. What am I missing here?
Thanks

You have to add the following lines in your code:
from pyspark.sql import SparkSession
spark = SparkSession(sc)
The method .toDF() is not an original method of the rdd.
If you take a look in Spark source code you will see that the method .toDF() is a monkey patch.
So, with SparkSession initialization you call this monkey pached method; in other words when you run rdd.toDF() you run directly the method .toDF() from Dataframe API.

Related

How to use structured spark streaming in python to insert row into Mongodb using ForeachWriter?

I'm using spark streaming to read data from kafka and insert that into mongodb. I'm using pyspark 2.4.4. I'm trying to make use of ForeachWriter because just using for each method means the connection will establishing for every row.
def open(self, partition_id, epoch_id):
# Open connection. This method is optional in Python.
self.connection = MongoClient("192.168.0.100:27017")
self.db = self.connection['test']
self.coll = self.db['output']
print(epoch_id)
pass
def process(self, row):
# Write row to connection. This method is NOT optional in Python.
#self.coll=None -> used this to test, if I'm getting an exception if it is there but I'm not getting one
self.coll.insert_one(row.asDict())
pass
def close(self, error):
# Close the connection. This method in optional in Python.
print(error)
pass
df_w=df7\
.writeStream\
.foreach(ForeachWriter())\
.trigger(processingTime='1 seconds') \
.outputMode("update") \
.option("truncate", "false")\
.start()df_w=df7\
.writeStream\
.foreach(ForeachWriter())\
.trigger(processingTime='1 seconds') \
.outputMode("update") \
.option("truncate", "false")\
.start()
My problem it's not inserting to mongodb and I can't find solutions for this. If comment it out I'll get error. But process method is not executing. any one have any ideas?

You set up the collection to None in the first line of the process function. Therefore you insert the row into nowhere.
Also, I don't know if it just here, or in your code as well, but you have the writeStream part two times.

This is probably not documented in spark docs. But if you look at the definition of foreach in pyspark, it has the following line of code:
# Check if the data should be processed
should_process = True
if open_exists:
should_process = f.open(partition_id, epoch_id)
Therefore, whenever we open a new connection, the open must return True. In actual documentation, they have used 'pass' which results in 'process()' never getting called. (This answer is for future reference for anybody facing the same issue.)

Apache Spark read multiple text files in single run

I can successfully load a text file into a DataFrame with the following Apache Spark Scala code:
val df = spark.read.text("first.txt")
.withColumn("fileName", input_file_name())
.withColumn("unique_id", monotonically_increasing_id())
Is there any way to provide the multiple files in the single run? Something like this:
val df = spark.read.text("first.txt,second.txt,someother.txt")
.withColumn("fileName", input_file_name())
.withColumn("unique_id", monotonically_increasing_id())
Right now the following code doesn't work with the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: file:first.txt,second.txt,someother.txt;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
How to properly load multiple text files?

The function spark.read.text() have a varargs parameter, from the docs:
def text(paths: String*): DataFrame
This means that to read multiple files you only need to supply them to the function separated by commas, i.e.
val df = spark.read.text("first.txt", "second.txt", "someother.txt")

SparkSQL Dataframe Error: value show is not a member of org.apache.spark.sql.DataFrameReader

I'm new to Spark/Scala/Dataframes. I'm using Scala 2.10.5, Spark 1.6.0. I am trying to load in a csv file and then create a dataframe from it. Using the scala shell I execute the following in the order below. Once I execute line 6, I get an error that says:
error: value show is not a member of org.apache.spark.sql.DataFrameReader
Could someone advise what I might be missing? I understand I don't need to import sparkcontext if I'm using the REPL (shell) so sc will be automatically created, but any ideas what I'm doing wrong?
1.import org.apache.spark.sql.SQLContext
import sqlContext.implicits._
val sqlContext = new SQLContext(sc)
val csvfile = "path_to_filename in hdfs...."
val df = sqlContext.read.format(csvfile).option("header", "true").option("inferSchema", "true")
df.show()

Try this:
val df = sqlContext.read.option("header", "true").option("inferSchema", "true").csv(csvfile)
sqlContext.read gives you a DataFrameReader, and option and format both set some options and give you back a DataFrameReader. You need to call one of the methods that gives you a DataFrame (like csv) before you can do things like show with it.
See https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader for more info.

spark scala issue uploading csv

i am trying to upload a csv file into a tempTable such that I can query on it and I am having two issues.
First: I tried uploading the csv to a DataFrame, and this csv has some empty fields.... and I didn't find a way to do it. I found someone posting in another post to use :
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("cars.csv")
but it gives me an error saying "Failed to load class for data source: com.databricks.spark.csv"
Then I uploaded the file and read it as a text file, without the headings as:
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
import sqlContext.implicits._;
case class cars(id: Int, name: String, licence: String);
val carsDF = sc.textFile("../myTests/cars.csv").map(_.split(",")).map(p => cars( p(0).trim.toInt, p(1).trim, p(2).trim) ).toDF();
carsDF.registerTempTable("cars");
val dgp = sqlContext.sql("SELECT * FROM cars");
dgp.show()
gives an error because one of the licence field is empty... I tried to control this issue when I build the data frame but did not work.
I can obviously go into the csv file but and fix by adding a null to it but U do not want to do this because of there are a lot of fields it could be problematic. I want to fix it programmatically either when i create the dataframe or the class...
any other thoughts please let me know as well

To be able to use spark-csv you have to make sure it is available. In an interactive mode the simplest solution is to use packages argument when you start shell:
bin/spark-shell --packages com.databricks:spark-csv_2.10:1.1.0
Regarding manual parsing working with csv files, especially malformed like cars.csv, requires much more work than simply splitting on commas. Some things to consider:
how to detect csv dialect, including method of string quoting
how to handle quotes and new line characters inside strings
how handle malformed lines
In case of example file you have to at least:
filter empty lines
read header
map lines to fields providing default value if field is missing

Here you go. Remember to check the delimiter for your CSV.
// create spark session
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark CSV Reader")
.getOrCreate;
// read csv
val df = spark.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.option("delimiter", ",")
.load("/your/csv/dir/simplecsv.csv")
// create a table from dataframe
df.createOrReplaceTempView("tableName")
// run your sql query
val sqlResults = spark.sql("SELECT * FROM tableName")
// display sql results
display(sqlResults)

Read ORC files directly from Spark shell

I am having issues reading an ORC file directly from the Spark shell. Note: running Hadoop 1.2, and Spark 1.2, using pyspark shell, can use spark-shell (runs scala).
I have used this resource http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_orc-spark-quickstart.html .
from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)
inputRead = sc.hadoopFile("hdfs://user#server:/file_path",
classOf[inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],
classOf[outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat])
I get an error generally saying wrong syntax. One time, the code seemed to work, I used just the 1st of three arguments passed to hadoopFile, but when I tried to use
inputRead.first()
the output was RDD[nothing, nothing]. I don't know if this is because the inputRead variable did not get created as an RDD or if it was not created at all.
I appreciate any help!

In Spark 1.5, I'm able to load my ORC file as:
val orcfile = "hdfs:///ORC_FILE_PATH"
val df = sqlContext.read.format("orc").load(orcfile)
df.show

You can try this code, it's working for me.
val LoadOrc = spark.read.option("inferSchema", true).orc("filepath")
LoadOrc.show()

you can also add the multiple path to read from
val df = sqlContext.read.format("orc").load("hdfs://localhost:8020/user/aks/input1/*","hdfs://localhost:8020/aks/input2/*/part-r-*.orc")