CSV format is not loading in spark-shell - scala

Using spark 1.6
I tried following code:
val diamonds = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/got_own/com_sep_fil.csv")
which caused the error
error: not found: value spark

In Spark 1.6 shell you get sc of type SparkContext, not spark of type SparkSession, if you want to get that functionlity you will need to instantiate a SqlContext
import org.apache.spark.sql._
val spark = new SQLContext(sc)

sqlContext is implict object SQL contect which can be used to load csv file and use com.databricks.spark.csv for mentionin csv file format
val df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("data.csv")

You need to initialize instance using SQLContext(spark version<2.0) or SparkSession(spark version>=2.0) to use methods provided by Spark.
To initialize spark instance for spark version < 2.0 use:
import org.apache.spark.sql._
val spark = new SQLContext(sc)
To initialize spark instance for spark version >= 2.0 use:
val spark = new SparkConf().setAppName("SparkSessionExample").setMaster("local")
To read the csv using spark 1.6 and databricks spark-csv package:
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("data.csv")

Related

Accessing Azure Data Lake Storage gen2 from Scala

I am able to connect to ADLS gen2 from a notebook running on Azure Databricks but am unable to connect from a job using a jar. I used the same settings as I did in the notebook, save for the use of dbutils.
I used the same setting for Spark conf from the notebook in the Scala code.
Notebook:
spark.conf.set(
"fs.azure.account.key.xxxx.dfs.core.windows.net",
dbutils.secrets.get(scope = "kv-secrets", key = "xxxxxx"))
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "true")
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")
val rdd = sqlContext.read.format
("csv").option("header",
"true").load(
"abfss://catalogs#xxxx.dfs.core.windows.net/test/sample.csv")
// Convert rdd to data frame using toDF; the following import is
//required to use toDF function.
val df: DataFrame = rdd.toDF()
// Write file to parquet
df.write.parquet
("abfss://catalogs#xxxx.dfs.core.windows.net/test/Sales.parquet")
Scala code:
val sc = SparkContext.getOrCreate()
val spark = SparkSession.builder().getOrCreate()
sc.getConf.setAppName("Test")
sc.getConf.set("fs.azure.account.key.xxxx.dfs.core.windows.net",
"<actual key>")
sc.getConf.set("fs.azure.account.auth.type", "OAuth")
sc.getConf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
sc.getConf.set("fs.azure.account.oauth2.client.id", "<app id>")
sc.getConf.set("fs.azure.account.oauth2.client.secret", "<app password>")
sc.getConf.set("fs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/<tenant id>/oauth2/token")
sc.getConf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")
val sqlContext = spark.sqlContext
val rdd = sqlContext.read.format
("csv").option("header",
"true").load
("abfss://catalogs#xxxx.dfs.core.windows.net/test/sample.csv")
// Convert rdd to data frame using toDF; the following import is
//required to use toDF function.
val df: DataFrame = rdd.toDF()
println(df.count())
// Write file to parquet
df.write.parquet
("abfss://catalogs#xxxx.dfs.core.windows.net/test/Sales.parquet")
I expected the parquet file to get written. Instead I get the following error:
19/04/20 13:58:40 ERROR Uncaught throwable from user code: Configuration property xxxx.dfs.core.windows.net not found.
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:385)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:802)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.(AzureBlobFileSystemStore.java:133)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:103)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
Never mind, silly mistake. It should be:
val sc = SparkContext.getOrCreate()
val spark = SparkSession.builder().getOrCreate()
sc.getConf.setAppName("Test")
spark.conf.set("fs.azure.account.key.xxxx.dfs.core.windows.net",
"<actual key>")
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "<app id>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<app password>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/<tenant id>/oauth2/token")
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")

How to fix 22: error: not found: value SparkSession in Scala?

I am new to Spark and I would like to read a CSV-file to a Dataframe.
Spark 1.3.0 / Scala 2.3.0
This is what I have so far:
# Start Scala with CSV Package Module
spark-shell --packages com.databricks:spark-csv_2.10:1.3.0
# Import Spark Classes
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import sqlCtx ._
# Create SparkConf
val conf = new SparkConf().setAppName("local").setMaster("master")
val sc = new SparkContext(conf)
# Create SQLContext
val sqlCtx = new SQLContext(sc)
# Create SparkSession and use it for all purposes:
val session = SparkSession.builder().appName("local").master("master").getOrCreate()
# Read CSV-File and turn it into Dataframe.
val df_fc = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/home/Desktop/test.csv")
However at SparkSession.builder() it gives the following error:
^
How can I fix this error?
SparkSession is available in spark 2. No need to create sparkcontext in spark version 2. sparksession itself provides the gateway to all .
Try below as you are using version 1.x:
val df_fc = sqlCtx.read.format("com.databricks.spark.csv").option("header", "true").load("/home/Desktop/test.csv")

Unable to create SparkContext object using Apache Spark 2.2 version

I use MS Windows 7.
Initially, I tried one program using scala in Spark 1.6 and it worked fine (where I am getting SparkContext object as sc automatically).
When I tried Spark 2.2, I am not getting sc automatically so I created one by doing the following steps:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sc = new SparkConf().setAppName("myname").setMaster("mast")
new SparkContext(sc)
Now when I am trying to execute below parallelize method it gives me one error:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
Error:
Value parallelize is not a member of org.apache.spark.SparkConf
I followed these steps using official documentation only. So can anybody explain me where I went wrong? Thanks in advance. :)
If spark-shell doesn't show this line on start:
Spark context available as 'sc' (master = local[*], app id = local-XXX).
Run
val sc = SparkContext.getOrCreate()
The issue is that you created sc of type SparkConfig not SparkContext (both have the same initials).
For using parallelize method in Spark 2.0 version or any other version, sc should be SparkContext and not SparkConf. The correct code should be like this:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sparkConf = new SparkConf().setAppName("myname").setMaster("mast")
val sc = new SparkContext(sparkConf)
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
This will give you the desired result.
You should prefer to use SparkSession as it is the the entry point for Spark from version 2. You could try something like :
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.
master("local")
.appName("spark session example")
.getOrCreate()
val sc = spark.sparkContext
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
There is some problem with 2.2.0 version of Apache Spark. I replaced it with 2.2.1 version which is the latest one and i am able to get sc and spark variables automatically when I start spark-shell via cmd in windows 7. I hope it will help someone.
I executed below code which creates rdd and it works perfectly. No need to import any packages.
val dataOne=sc.parallelize(1 to 10)
dataOne.collect(); //Will print 1 to 10 numbers in array
Your code shud like this
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("myname")
val sc = new SparkContext(conf)
NOTE: master url should be local[*]

Reading TSV into Spark Dataframe with Scala API

I have been trying to get the databricks library for reading CSVs to work. I am trying to read a TSV created by hive into a spark data frame using the scala api.
Here is an example that you can run in the spark shell (I made the sample data public so it can work for you)
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val sqlContext = new SQLContext(sc)
val segments = sqlContext.read.format("com.databricks.spark.csv").load("s3n://michaeldiscenza/data/test_segments")
The documentation says you can specify the delimiter but I am unclear about how to specify that option.
All of the option parameters are passed in the option() function as below:
val segments = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.load("s3n://michaeldiscenza/data/test_segments")
With Spark 2.0+ use the built-in CSV connector to avoid third party dependancy and better performance:
val spark = SparkSession.builder.getOrCreate()
val segments = spark.read.option("sep", "\t").csv("/path/to/file")
You May also try to inferSchema and check for schema.
val df = spark.read.format("csv")
.option("inferSchema", "true")
.option("sep","\t")
.option("header", "true")
.load(tmp_loc)
df.printSchema()

where is the declaration of SchemaRDD in spark 1.3.0's API

these code reports error in IDEA,why?
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
val people = sc.textFile("c3/test.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerTempTable("people")
Is there another way to transform sqlContext into SchemaRDD, excepting the import sqlContext.createSchemaRDD?
I can't find the SchemaRDD class in spark api document, why?
SchemaRDD has been renamed to DataFrame in Apache Spark 1.3.0. See the migration guide.