special character in dataframe spark - scala

I have a CSV file with the following content
id,pos_id,supplier_id
5127973,2000,"test
5704355,77,/10122
I wanted to load it into a dataframe and the data as it is , this dataframe will be loaded into postresql through JDBC
Here what I did:
val conf = new SparkConf().setMaster("local[2]").setAppName("my app")
val sc = new SparkContext(conf)
val sparkSession = SparkSession.builder.config(conf = conf).appName("spark session example").getOrCreate()
val df= sparkSession.sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true").option("escape", "\"").load("C:\\Users\\MHT\\Desktop\\data.csv")
df.show()
+-------+------+--------------------+
| id|pos_id| supplier_id|
+-------+------+--------------------+
|5127973| 2000|test
5704355,77,/...|
+-------+------+--------------------+
What should I do to get the same data in the dataframe and then the same data in postresql.

Write the csv data on to HDFS and using sqoop we can export the data to the destination database by providing required jdbc jars in the $SQOOP_HOME/lib directory.

Related

.csv not a SequenceFile error on Select Hive Query

I am quite a newbie to Spark and Scala ;)
Code summary :
Reading data from CSV files --> Creating A simple inner join on 2 Files --> Writing data to Hive table --> Submitting the job on the cluster
Can you please help to identify what went wrong.
The code is not really complex.
The job is executed well on cluster.
Therefore when I try to visualize data written on hive table I am facing issue.
hive> select * from Customers limit 10;
Failed with exception java.io.IOException:java.io.IOException: hdfs://m01.itversity.com:9000/user/itv000666/warehouse/updatedcustomers.db/customers/part-00000-348a54cf-aa0c-45b4-ac49-3a881ae39702_00000.c000 .csv not a SequenceFile
object LapeyreSparkDemo extends App {
//Getting spark ready
val sparkConf = new SparkConf()
sparkConf.set("spark.app.name","Spark for Lapeyre")
//Creating Spark Session
val spark = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.config("spark.sql.warehouse.dir","/user/itv000666/warehouse")
.getOrCreate()
Logger.getLogger(getClass.getName).info("Spark Session Created Successfully")
//Reading
Logger.getLogger(getClass.getName).info("Data loading in DF started")
val ordersSchema = "orderid Int, customerName String, orderDate String, custId Int, orderStatus
String, age String, amount Int"
val orders2019Df = spark.read
.format("csv")
.option("header",true)
.schema(ordersSchema)
.option("path","/user/itv0006666/lapeyrePoc/orders2019.csv")
.load
val newOrder = orders2019Df.withColumnRenamed("custId", "oldCustId")
.withColumnRenamed("customername","oldCustomerName")
val orders2020Df = spark.read
.format("csv")
.option("header",true)
.schema(ordersSchema)
.option("path","/user/itv000666/lapeyrePoc/orders2020.csv")
.load
Logger.getLogger(getClass.getName).info("Data loading in DF complete")
//processing
Logger.getLogger(getClass.getName).info("Processing Started")
val joinCondition = newOrder.col("oldCustId") === orders2020Df.col("custId")
val joinType = "inner"
val joinData = newOrder.join(orders2020Df, joinCondition, joinType)
.select("custId","customername")
//Writing
spark.sql("create database if not exists updatedCustomers")
joinData.write
.format("csv")
.mode(SaveMode.Overwrite)
.bucketBy(4, "custId")
.sortBy("custId")
.saveAsTable("updatedCustomers.Customers")
//Stopping Spark Session
spark.stop()
}
Please let me know in case more information required.
Thanks in advance.
This is the culprit
joinData.write
.format("csv")
Instead used this and it worked.
joinData.write
.format("Hive")
Since I am writing data to hive table (orc format), the format should be "Hive" and not csv.
Also, do not forget to enable hive support while creating spark session.
Also, In spark 2, bucketby & sortby is not supported. Maybe it does in Spark 3.

CSV format is not loading in spark-shell

Using spark 1.6
I tried following code:
val diamonds = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/got_own/com_sep_fil.csv")
which caused the error
error: not found: value spark
In Spark 1.6 shell you get sc of type SparkContext, not spark of type SparkSession, if you want to get that functionlity you will need to instantiate a SqlContext
import org.apache.spark.sql._
val spark = new SQLContext(sc)
sqlContext is implict object SQL contect which can be used to load csv file and use com.databricks.spark.csv for mentionin csv file format
val df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("data.csv")
You need to initialize instance using SQLContext(spark version<2.0) or SparkSession(spark version>=2.0) to use methods provided by Spark.
To initialize spark instance for spark version < 2.0 use:
import org.apache.spark.sql._
val spark = new SQLContext(sc)
To initialize spark instance for spark version >= 2.0 use:
val spark = new SparkConf().setAppName("SparkSessionExample").setMaster("local")
To read the csv using spark 1.6 and databricks spark-csv package:
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("data.csv")

Accessing Azure Data Lake Storage gen2 from Scala

I am able to connect to ADLS gen2 from a notebook running on Azure Databricks but am unable to connect from a job using a jar. I used the same settings as I did in the notebook, save for the use of dbutils.
I used the same setting for Spark conf from the notebook in the Scala code.
Notebook:
spark.conf.set(
"fs.azure.account.key.xxxx.dfs.core.windows.net",
dbutils.secrets.get(scope = "kv-secrets", key = "xxxxxx"))
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "true")
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")
val rdd = sqlContext.read.format
("csv").option("header",
"true").load(
"abfss://catalogs#xxxx.dfs.core.windows.net/test/sample.csv")
// Convert rdd to data frame using toDF; the following import is
//required to use toDF function.
val df: DataFrame = rdd.toDF()
// Write file to parquet
df.write.parquet
("abfss://catalogs#xxxx.dfs.core.windows.net/test/Sales.parquet")
Scala code:
val sc = SparkContext.getOrCreate()
val spark = SparkSession.builder().getOrCreate()
sc.getConf.setAppName("Test")
sc.getConf.set("fs.azure.account.key.xxxx.dfs.core.windows.net",
"<actual key>")
sc.getConf.set("fs.azure.account.auth.type", "OAuth")
sc.getConf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
sc.getConf.set("fs.azure.account.oauth2.client.id", "<app id>")
sc.getConf.set("fs.azure.account.oauth2.client.secret", "<app password>")
sc.getConf.set("fs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/<tenant id>/oauth2/token")
sc.getConf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")
val sqlContext = spark.sqlContext
val rdd = sqlContext.read.format
("csv").option("header",
"true").load
("abfss://catalogs#xxxx.dfs.core.windows.net/test/sample.csv")
// Convert rdd to data frame using toDF; the following import is
//required to use toDF function.
val df: DataFrame = rdd.toDF()
println(df.count())
// Write file to parquet
df.write.parquet
("abfss://catalogs#xxxx.dfs.core.windows.net/test/Sales.parquet")
I expected the parquet file to get written. Instead I get the following error:
19/04/20 13:58:40 ERROR Uncaught throwable from user code: Configuration property xxxx.dfs.core.windows.net not found.
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:385)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:802)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.(AzureBlobFileSystemStore.java:133)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:103)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
Never mind, silly mistake. It should be:
val sc = SparkContext.getOrCreate()
val spark = SparkSession.builder().getOrCreate()
sc.getConf.setAppName("Test")
spark.conf.set("fs.azure.account.key.xxxx.dfs.core.windows.net",
"<actual key>")
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "<app id>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<app password>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/<tenant id>/oauth2/token")
spark.conf.set
("fs.azure.createRemoteFileSystemDuringInitialization", "false")

Convert dataframe to hive table in spark scala

I am trying to convert a dataframe to hive table in spark Scala. I have read in a dataframe from an XML file. It uses SQL context to do so. I want to convert save this dataframe as a hive table. I am getting this error:
"WARN HiveContext$$anon$1: Could not persist database_1.test_table in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format."
object spark_conversion {
def main(args: Array[String]): Unit = {
if (args.length < 2) {
System.err.println("Usage: <input file> <output dir>")
System.exit(1)
}
val in_path = args(0)
val out_path_csv = args(1)
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("conversion")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
val df = hiveContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "PolicyPeriod")
.option("attributePrefix", "attr_")
.load(in_path)
df.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save(out_path_csv)
df.saveAsTable("database_1.test_table")
df.printSchema()
df.show()
saveAsTable in spark is not compatible with hive. I am on CDH 5.5.2. Workaround from cloudera website:
df.registerTempTable(tempName)
hsc.sql(s"""
CREATE TABLE $tableName (
// field definitions )
STORED AS $format """)
hsc.sql(s"INSERT INTO TABLE $tableName SELECT * FROM $tempName")
http://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html

How to convert JavaRDD<Tuple2<Object, BSONObject>> to DataFrame in Spark Mongo Connector?

I have a JavaRDD<Tuple2<Object, BSONObject>>
SparkContext sc = new SparkContext()
Configuration config = new Configuration();
config.set("mongo.input.uri","mongodb://localhost:27017:testDB.testCollection);
JavaRDD<Tuple2<Object, BSONObject>> mongoRDD = sc.newAPIHadoopRDD(config, MongoInputFormat.class, Object.class,
BSONObject.class).toJavaRDD();
How to convert this mongoRDD to DataFrame so that I can run SQL queries on it?
With a SQLcontext imported, you can use toDF, which requires a list of column names as argument.