Parquet file being read as empty - scala

I'm trying to read a parquet file that I downloaded from the HDFS on my Jupyter notebook however it is showing up as empty. I know it is not empty because I had worked on it prior to saving it to the HDFS. Does anyone know why it is being read as empty?
The size of the file on the HDFS and cluster environment:
hadoop fs -du -s -h /user/some/test.parquet
1.2 M 3.5 M /user/some/test.parquet
val test = spark.read.parquet("hdfs:///user/some/test.parquet")
test.count()
res0: Long = 10
On an almond-kernel in Jupyter notebook to work in Scala.
import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)
import org.apache.spark.sql._
val spark = {
SparkSession.builder()
.master("local[*]")
.getOrCreate()
}
def sc = spark.sparkContext
val test = spark.read.parquet("/Users/me/some/test.parquet")
test: DataFrame = [UnitId: string, GeoId: string ... 26 more fields]
test.count()
res28: Long = 0L

For anyone who's interested, I figured out the issue.
I had downloaded the parquet file from the HDFS using "hadoop fs -getmerge", resulting in the corruption of the file.
The right approach when dealing with the parquet file would be to "hadoop fs -get.

Related

Issues with Spark and Salesforce Connection

I am trying to load in a table from SalesForce using spark. I invoked this code
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
object Sample {
def main(arg: Array[String]) {
val spark = SparkSession.builder().
appName("salesforce").
master("local[*]").
getOrCreate()
val tableName = "Opportunity"
val outputPath = "output/result" + tableName
val salesforceDf = spark.
read.
format("jdbc").
option("url", "jdbc:datadirect:sforce://login.salesforce.com;").
option("driver", "com.ddtek.jdbc.sforce.SForceDriver").
option("dbtable", tableName).
option("user", "").
option("password", "xxxxxxxxx").
option("securitytoken", "xxxxx")
.load()
salesforceDf.createOrReplaceTempView("Opportunity")
spark.sql("select * from Opportunity").collect.foreach(println)
//save the result
salesforceDf.write.save(outputPath)
}
}
And the docs I was referring to said to start a spark shell as:
spark-shell --jars /path_to_driver/sforce.jar
Which outputted a lot of lines in the terminal and this was the last line:
22/07/12 14:57:56 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a12b060d-5c82-4283-b2b9-53f9b3863b53
And then to submit spark
spark-submit --jars sforce.jar --class <Your class name> your jar file
However I am not sure where this jar file is and if that was substantiated? and how to submit that. Any help is appreciated, thank you.

Spark Streaming : Write Data to HDFS by reading from one HDFSdir to another

I am trying to use spark streaming in reading data from one HDFS location to another
Below is my code snippet on spark-shell
But I couldn't see the files created on HDFS output directory
Can some point point how to load the files on HDFS
scala> sc.stop()
scala> import org.apache.spark.SparkConf
scala> import org.apache.spark.streaming
scala> import org.apache.spark.streaming.{StreamingContext,Seconds}
scala> val conf = new SparkConf().setMaster("local[2]").setAppName("files_word_count")
scala> val ssc = new StreamingContext(conf,Seconds(10))
scala> val DF = ssc.textFileStream("/user/cloudera/streamingcontext_dir")
scala> val words_freq = DF.flatMap(x=>(x.split(" "))).map(y=>(y,1)).reduceByKey(_+_)
scala> words_freq.saveAsTextFiles("hdfs://localhost:8020/user/cloudera/streamingcontext_dir2")
scala> ssc.start()
I have placed files on HDFS "/user/cloudera/streamingcontext_dir" and created another directory "/user/cloudera/streamingcontext_dir2" for seeing the files written
But I couldn't see the files in the output directory
Can someone point what's wrong here ?
Thanks
Sumit
Try making use of RDD here and not the entire DStream perhaps:
words_freq.foreachRDD(rdd =>
rdd.saveAsTextFile("hdfs://localhost:8020/user/cloudera/streamingcontext_dir2")

spark scala datastax csv load file and print schema

Spark version 2.0.2.6
Scala version 2.11.11
Using DataStax 5.0
import org.apache.log4j.{Level, Logger}
import java.util.Calendar
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
object csvtocassandra {
def main(args: Array[String]): Unit = {
val key_space = scala.io.StdIn.readLine("Please enter cassandra Key Space Name: ")
val table_name = scala.io.StdIn.readLine("Please enter cassandra Table Name: ")
// Cassandra Part
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
println(Calendar.getInstance.getTime)
// Scala Read CSV Part
val spark1 = org.apache.spark.sql.SparkSession.builder().master("local").config("spark.cassandra.connection.host", "127.0.0.1")
.appName("Spark SQL basic example").getOrCreate()
val csv_input = scala.io.StdIn.readLine("Please enter csv file location: ")
val df_csv = spark1.read.format("csv").option("header", "true").option("inferschema", "true").load(csv_input)
df_csv.printSchema()
}
}
Why am I not able to run this program as a Job trying to submit it to spark. When I run this program using IntelliJ it works.
But When I create a JAR and run it I am getting following Error.
Command:
> dse spark-submit --class "csvtospark" /Users/del/target/scala-2.11/csvtospark_2.11-1.0.jar
I am getting following Error:
ERROR 2017-11-02 11:46:10,245 org.apache.spark.deploy.DseSparkSubmitBootstrapper: Failed to start or submit Spark application
org.apache.spark.sql.AnalysisException: Path does not exist: dsefs://127.0.0.1/Users/Desktop/csv/example.csv;
Why is it appending dsefs://127.0.0.1 part even though I am giving just the path /Users/Desktop/csv/example.csv when asked.
I tried giving --mater option as well. How ever I am getting the same error. I am running DataStax Spark in Local Machine. No Cluster.
Please correct me where I am doing things wrong.
Got it. Never mind. Sorry about that.
input should be file:///file_name

How to use Spark-Scala to download a CSV file from the web?

world,
How to use Spark-Scala to download a CSV file from the web and load the file into a spark-csv DataFrame?
Currently I depend on curl in a shell command to get my CSV file.
Here is the syntax I want to enhance:
/* fb_csv.scala
This script should load FB prices from Yahoo.
Demo:
spark-shell -i fb_csv.scala
*/
// I should get prices:
import sys.process._
"/usr/bin/curl -o /tmp/fb.csv http://ichart.finance.yahoo.com/table.csv?s=FB"!
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val fb_df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("/tmp/fb.csv")
fb_df.head(9)
I want to enhance the above script so it is pure Scala with no shell syntax inside.
val content = scala.io.Source.fromURL("http://ichart.finance.yahoo.com/table.csv?s=FB").mkString
val list = content.split("\n").filter(_ != "")
val rdd = sc.parallelize(list)
val df = rdd.toDF
Found better answer from Process CSV from REST API into Spark
Here you go:
import scala.io.Source._
import org.apache.spark.sql.{Dataset, SparkSession}
var res = fromURL(url).mkString.stripMargin.lines.toList
val csvData: Dataset[String] = spark.sparkContext.parallelize(res).toDS()
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.printSchema()

Use Spark to list all files in a Hadoop HDFS directory?

I want to loop through all text files in a Hadoop dir and count all the occurrences of the word "error". Is there a way to do a hadoop fs -ls /users/ubuntu/ to list all the files in a dir with the Apache Spark Scala API?
From the given first example, the spark context seems to only access files individually through something like:
val file = spark.textFile("hdfs://target_load_file.txt")
In my problem, I do not know how many nor the names of the files in the HDFS folder beforehand. Looked at the spark context docs but couldn't find this kind of functionality.
You can use a wildcard:
val errorCount = sc.textFile("hdfs://some-directory/*")
.flatMap(_.split(" ")).filter(_ == "error").count
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import scala.collection.mutable.Stack
val fs = FileSystem.get( sc.hadoopConfiguration )
var dirs = Stack[String]()
val files = scala.collection.mutable.ListBuffer.empty[String]
val fs = FileSystem.get(sc.hadoopConfiguration)
dirs.push("/user/username/")
while(!dirs.isEmpty){
val status = fs.listStatus(new Path(dirs.pop()))
status.foreach(x=> if(x.isDirectory) dirs.push(x.getPath.toString) else
files+= x.getPath.toString)
}
files.foreach(println)
For a local installation, (the hdfs default path fs.defaultFS can be found by reading /etc/hadoop/core.xml):
For instance,
import org.apache.hadoop.fs.{FileSystem, Path}
val conf = sc.hadoopConfiguration
conf.set("fs.defaultFS", "hdfs://localhost:9000")
val hdfs: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.FileSystem.get(conf)
val fileStatus = hdfs.listStatus(new Path("hdfs://localhost:9000/foldername/"))
val fileList = fileStatus.map(x => x.getPath.toString)
fileList.foreach(println)