Reading parquet file with PySpark

Reading parquet file with PySpark - pyspark

I am new to Pyspark and nothing seems to be working out. Please rescue.
I want to read a parquet file with Pyspark. I wrote the following codes.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
sqlContext.read.parquet("my_file.parquet")
I got the following error
Py4JJavaError Traceback (most recent call
last) /usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name)
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
then I tried the following codes
from pyspark.sql import SQLContext
sc = SparkContext.getOrCreate()
SQLContext.read.parquet("my_file.parquet")
Then the error was as follows :
AttributeError: 'property' object has no attribute 'parquet'

You need to create an instance of SQLContext first.
This will work from pyspark shell:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
sqlContext.read.parquet("my_file.parquet")
If you are using spark-submit you need to create the SparkContext in which case you would do this:
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
sqlContext.read.parquet("my_file.parquet")

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sc.stop()
conf = (conf.setMaster('local[*]'))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.parquet("my_file.parquet")
Try this.

Related

Getting "invalid syntax" error while reading data from text file using pyspark

I'm trying to read text file using pyspark. Data in file is comma separated.
I've already tried reading data using sqlcontext.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
sc = SparkContext._active_spark_context
filePath = './data_files/data.txt'
sqlContext = SQLContext(sc)
print(fileData)
schema = StructType([StructField('ID', IntegerType(), False),
StructField('Name', StringType(), False),
StructField('Project', StringType(), False),
StructField('Location', StringType(), False)])
print(schema)
fileRdd = sc.textFile(fileData).map(_.split(",")).map{x => org.apache.spark.sql.Row(x:_*)}
sqlDf = sqlContext.createDataFrame(fileRdd,schema)
sqlDf.show()
I'm getting following error.
File "", line 1
fileRdd = sc.textFile(fileData).map(.split(",")).map{x => org.apache.spark.sql.Row(x:*)}
^ SyntaxError: invalid syntax

I've tried using following code and it is working fine.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
sc = SparkContext._active_spark_context
sc = SparkContext("local", "first app")
sqlContext = SQLContext(sc)
filePath = "./data_files/data.txt"
# Load a text file and convert each line to a Row.
lines = sc.textFile(filePath)
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0].strip(), p[1], p[2], p[3]))
# The schema is encoded in a string.
schemaString = "ID Name Project Location"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
schemaPeople = sqlContext.createDataFrame(people, schema)
schemaPeople.show()

How to fix 22: error: not found: value SparkSession in Scala?

I am new to Spark and I would like to read a CSV-file to a Dataframe.
Spark 1.3.0 / Scala 2.3.0
This is what I have so far:
# Start Scala with CSV Package Module
spark-shell --packages com.databricks:spark-csv_2.10:1.3.0
# Import Spark Classes
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import sqlCtx ._
# Create SparkConf
val conf = new SparkConf().setAppName("local").setMaster("master")
val sc = new SparkContext(conf)
# Create SQLContext
val sqlCtx = new SQLContext(sc)
# Create SparkSession and use it for all purposes:
val session = SparkSession.builder().appName("local").master("master").getOrCreate()
# Read CSV-File and turn it into Dataframe.
val df_fc = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/home/Desktop/test.csv")
However at SparkSession.builder() it gives the following error:
^
How can I fix this error?

SparkSession is available in spark 2. No need to create sparkcontext in spark version 2. sparksession itself provides the gateway to all .
Try below as you are using version 1.x:
val df_fc = sqlCtx.read.format("com.databricks.spark.csv").option("header", "true").load("/home/Desktop/test.csv")

Error found when importing spark.implicits

I am using spark 1.4.0
When I tried to import spark.implicits using this command:
import spark.implicits._, this error appear:
<console>:19: error: not found: value spark
import spark.implicits._
^
Can anyone help me to resolve this problem ?

It's because SparkSession is avialable from Spark 2.0 and spark value is an object of type SparkSession in Spark REPL.
In Spark 1.4 use
import sqlContext.implicits._
Value sqlContext is automatically created in Spark REPL for Spark 1.x

To make it complete, first you have to create a sqlContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
val conf = new SparkConf().setMaster("local").setAppName("my app")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._

44: error: value read is not a member of object org.apache.spark.sql.SQLContext

I am using Spark 1.6.1, and Scala 2.10.5. I am trying to read the csv file through com.databricks.
While launching the spark-shell, I use below lines as well
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 --driver-class-path path to/sqljdbc4.jar, and below is the whole code
import java.util.Properties
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
val conf = new SparkConf().setAppName("test").setMaster("local").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = SQLContext.read().format("com.databricks.spark.csv").option("inferScheme","true").option("header","true").load("path_to/data.csv");
I am getting below error:-
error: value read is not a member of object org.apache.spark.sql.SQLContext,
and the "^" is pointing toward "SQLContext.read().format" in the error message.
I did try the suggestions available in stackoverflow, as well as other sites as well. but nothing seems to be working.

SQLContext means object access - static methods in class.
You should use sqlContext variable, as methods are not static, but are in class
So code should be:
val df = sqlContext.read.format("com.databricks.spark.csv").option("inferScheme","true").option("header","true").load("path_to/data.csv");

Using magellan geospatial library with apache spark for standalone applications

I am trying to run simple test code in pyspark for printing points using magellan library like from the github repository, but I have problem of undefined sc context.
If I run it from command line with proposed command $SPARK_HOME/bin/spark-submit --packages harsha2010:magellan:1.0.2-s_2.10 everything works because sc is imported automatically but if I run it as a standalone application from eclipse it does not recognize sc.
I have tried all combinations for its initialization including this piece of code:
from pyspark import SparkConf,SparkContext
from magellan.types import Point
from pyspark.sql import Row, SQLContext
#from magellan-master.python.magellan.context import sc
sc = SparkContext(appName="MyGeoFencing")
#sql = SQLContext(sc)
#from magellan.context import sc
#from magellan.context import sc
#from magellan.context import SQLContext
PointRecord = Row("id", "point")
#sparkConf = SparkConf().setAppName("MyGeoFencing")
#sc = SparkContext(conf=sparkConf)
#sql = SQLContext(sc)
sqlCont = SQLContext(sc)
points = sqlCont.parallelize([
(0, Point(-1.0, -1.0)),
(1, Point(-1.0, 1.0)),
(2, Point(1.0, -1.0))]).map(lambda x: PointRecord(*x)).toDF()
points.show()
Here is the problem that sqlCont does not have method parallelize.
I even tried importing directly sc from magellan.context, but does not work either.
The same problem stands when I use scala!
Do you have some idea how this should work?
Thanks!

This works for me:
sc = spark.sparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Reading parquet file with PySpark - pyspark

from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext sc.stop() conf = (conf.setMaster('local[*]')) sc = SparkContext(conf = conf) sqlContext = SQLContext(sc) df = sqlContext.read.parquet("my_file.parquet") Try this.

Related

Getting "invalid syntax" error while reading data from text file using pyspark

How to fix 22: error: not found: value SparkSession in Scala?

Error found when importing spark.implicits

44: error: value read is not a member of object org.apache.spark.sql.SQLContext

Using magellan geospatial library with apache spark for standalone applications

Categories

Resources