Spark Dataframe content can be printed out but (e.g.) not counted - scala

Strangely this doesnt work. Can someone explain the background? I want to understand why it doesnt take this.
The Inputfiles are parquet files spread across multiple folders. When I print the results, they are structured as I want to. When I use a dataframe.count() on the joined dataframe, the job will run forever. Can anyone help with the Details on that
import org.apache.spark.{SparkContext, SparkConf}
object TEST{
def main(args: Array[String] ) {
val appName = args(0)
val threadMaster = args(1)
val inputPathSent = args(2)
val inputPathClicked = args(3)
// pass spark configuration
val conf = new SparkConf()
.setMaster(threadMaster)
.setAppName(appName)
// Create a new spark context
val sc = new SparkContext(conf)
// Specify a SQL context and pass in the spark context we created
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create two dataframes for sent and clicked files
val dfSent = sqlContext.read.parquet(inputPathSent)
val dfClicked = sqlContext.read.parquet(inputPathClicked)
// Join them
val dfJoin = dfSent.join(dfClicked, dfSent.col("customer_id")
===dfClicked.col("customer_id") && dfSent.col("campaign_id")===
dfClicked.col("campaign_id"), "left_outer")
dfJoin.show(20) // perfectly shows the first 20 rows
dfJoin.count() //Here we run into trouble and it runs forever
}
}

Use println(dfJoin.count())
You will be able to see the count in your screen.

Related

Using iterated writing in HDFS file by using Spark/Scala

I am learning how to read and write from files in HDFS by using Spark/Scala.
I am unable to write in HDFS file, the file is created, but it's empty.
I don't know how to create a loop for writing in a file.
The code is:
import scala.collection.immutable.Map
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
// Read the adult CSV file
val logFile = "hdfs://zobbi01:9000/input/adult.csv"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
//val logFile = sc.textFile("hdfs://zobbi01:9000/input/adult.csv")
val headerAndRows = logData.map(line => line.split(",").map(_.trim))
val header = headerAndRows.first
val data = headerAndRows.filter(_(0) != header(0))
val maps = data.map(splits => header.zip(splits).toMap)
val result = maps.filter(map => map("AGE") != "23")
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
If I replace:
result.foreach{println}
Then it works!
but when using the method of (saveAsTextFile), then an error message is thrown as
<console>:76: error: type mismatch;
found : Unit
required: scala.collection.immutable.Map[String,String] => Unit
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
Any help please.
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
This is all what you need to do. You don't need to loop through all the rows.
Hope this helps!
What this does!!!
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
RDD action cannot be triggered from RDD transformations unless special conf set.
Just use result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt") to save to HDFS.
I f you need other formats in the file to be written, change in rdd itself before writing.

spark dataframe write to file using scala

I am trying to read a file and add two extra columns. 1. Seq no and 2. filename.
When I run spark job in scala IDE output is generated correctly but when I run in putty with local or cluster mode job is stucks at stage-2 (save at File_Process). There is no progress even i wait for an hour. I am testing on 1GB data.
Below is the code i am using
object File_Process
{
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("yarn")
.appName("File_Process")
.getOrCreate()
def main(arg:Array[String])
{
val FileDF = spark.read
.csv("/data/sourcefile/")
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
val dataframefinal = datasetnew.withColumn("Filetag", lit(filename))
val query = dataframefinal.write
.mode("overwrite")
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.save("/data/text_file/")
spark.stop()
}
If I remove logic to add seq_no, code is working fine.
code for creating seq no is
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow =>Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
Thanks in advance.

Spark-submit cannot access local file system

Really simple Scala code files at the first count() method call.
def main(args: Array[String]) {
// create Spark context with Spark configuration
val sc = new SparkContext(new SparkConf().setAppName("Spark File Count"))
val fileList = recursiveListFiles(new File("C:/data")).filter(_.isFile).map(file => file.getName())
val filesRDD = sc.parallelize(fileList)
val linesRDD = sc.textFile("file:///temp/dataset.txt")
val lines = linesRDD.count()
val files = filesRDD.count()
}
I don't want to set up a HDFS installation for this right now. How do I configure Spark to use the local file system? This works with spark-shell.
To read the file from local filesystem(From Windows directory) you need to use below pattern.
val fileRDD = sc.textFile("C:\\Users\\Sandeep\\Documents\\test\\test.txt");
Please see below sample working program to read data from local file system.
package com.scala.example
import org.apache.spark._
object Test extends Serializable {
val conf = new SparkConf().setAppName("read local file")
conf.set("spark.executor.memory", "100M")
conf.setMaster("local");
val sc = new SparkContext(conf)
val input = "C:\\Users\\Sandeep\\Documents\\test\\test.txt"
def main(args: Array[String]): Unit = {
val fileRDD = sc.textFile(input);
val counts = fileRDD.flatMap(line => line.split(","))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.collect().foreach(println)
//Stop the Spark context
sc.stop
}
}
val sc = new SparkContext(new SparkConf().setAppName("Spark File
Count")).setMaster("local[8]")
might help

Writing to a file in Apache Spark

I am writing a Scala code that requires me to write to a file in HDFS.
When I use Filewriter.write on local, it works. The same thing does not work on HDFS.
Upon checking, I found that there are the following options to write in Apache Spark-
RDD.saveAsTextFile and DataFrame.write.format.
My question is: what if I just want to write an int or string to a file in Apache Spark?
Follow up:
I need to write to an output file a header, DataFrame contents and then append some string.
Does sc.parallelize(Seq(<String>)) help?
create RDD with your data (int/string) using Seq: see parallelized-collections for details:
sc.parallelize(Seq(5)) //for writing int (5)
sc.parallelize(Seq("Test String")) // for writing string
val conf = new SparkConf().setAppName("Writing Int to File").setMaster("local")
val sc = new SparkContext(conf)
val intRdd= sc.parallelize(Seq(5))
intRdd.saveAsTextFile("out\\int\\test")
val conf = new SparkConf().setAppName("Writing string to File").setMaster("local")
val sc = new SparkContext(conf)
val stringRdd = sc.parallelize(Seq("Test String"))
stringRdd.saveAsTextFile("out\\string\\test")
Follow up Example: (Tested as below)
val conf = new SparkConf().setAppName("Total Countries having Icon").setMaster("local")
val sc = new SparkContext(conf)
val headerRDD= sc.parallelize(Seq("HEADER"))
//Replace BODY part with your DF
val bodyRDD= sc.parallelize(Seq("BODY"))
val footerRDD = sc.parallelize(Seq("FOOTER"))
//combine all rdds to final
val finalRDD = headerRDD ++ bodyRDD ++ footerRDD
//finalRDD.foreach(line => println(line))
//output to one file
finalRDD.coalesce(1, true).saveAsTextFile("test")
output:
HEADER
BODY
FOOTER
more examples here. . .

Spark and HBase Snapshots

Under the assumption that we could access data much faster if pulling directly from HDFS instead of using the HBase API, we're trying to build an RDD based on a Table Snapshot from HBase.
So, I have a snapshot called "dm_test_snap". I seem to be able to get most of the configuration stuff working, but my RDD is null (despite there being data in the Snapshot itself).
I'm having a hell of a time finding an example of anyone doing offline analysis of HBase snapshots with Spark, but I can't believe I'm alone in trying to get this working. Any help or suggestions are greatly appreciated.
Here is a snippet of my code:
object TestSnap {
def main(args: Array[String]) {
val config = ConfigFactory.load()
val hbaseRootDir = config.getString("hbase.rootdir")
val sparkConf = new SparkConf()
.setAppName("testnsnap")
.setMaster(config.getString("spark.app.master"))
.setJars(SparkContext.jarOfObject(this))
.set("spark.executor.memory", "2g")
.set("spark.default.parallelism", "160")
val sc = new SparkContext(sparkConf)
println("Creating hbase configuration")
val conf = HBaseConfiguration.create()
conf.set("hbase.rootdir", hbaseRootDir)
conf.set("hbase.zookeeper.quorum", config.getString("hbase.zookeeper.quorum"))
conf.set("zookeeper.session.timeout", config.getString("zookeeper.session.timeout"))
conf.set("hbase.TableSnapshotInputFormat.snapshot.name", "dm_test_snap")
val scan = new Scan
val job = Job.getInstance(conf)
TableSnapshotInputFormat.setInput(job, "dm_test_snap",
new Path("hdfs://nameservice1/tmp"))
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableSnapshotInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
hBaseRDD.count()
System.exit(0)
}
}
Update to include the solution
The trick was, as #Holden mentioned below, the conf wasn't getting passed through. To remedy this, I was able to get it working by changing this the call to newAPIHadoopRDD to this:
val hBaseRDD = sc.newAPIHadoopRDD(job.getConfiguration, classOf[TableSnapshotInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
There was a second issue that was also highlighted by #victor's answer, that I was not passing in a scan. To fix that, I added this line and method:
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
def convertScanToString(scan : Scan) = {
val proto = ProtobufUtil.toScan(scan);
Base64.encodeBytes(proto.toByteArray());
}
This also let me pull out this line from the conf.set commands:
conf.set("hbase.TableSnapshotInputFormat.snapshot.name", "dm_test_snap")
*NOTE: This was for HBase version 0.96.1.1 on CDH5.0
Final full code for easy reference:
object TestSnap {
def main(args: Array[String]) {
val config = ConfigFactory.load()
val hbaseRootDir = config.getString("hbase.rootdir")
val sparkConf = new SparkConf()
.setAppName("testnsnap")
.setMaster(config.getString("spark.app.master"))
.setJars(SparkContext.jarOfObject(this))
.set("spark.executor.memory", "2g")
.set("spark.default.parallelism", "160")
val sc = new SparkContext(sparkConf)
println("Creating hbase configuration")
val conf = HBaseConfiguration.create()
conf.set("hbase.rootdir", hbaseRootDir)
conf.set("hbase.zookeeper.quorum", config.getString("hbase.zookeeper.quorum"))
conf.set("zookeeper.session.timeout", config.getString("zookeeper.session.timeout"))
val scan = new Scan
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
val job = Job.getInstance(conf)
TableSnapshotInputFormat.setInput(job, "dm_test_snap",
new Path("hdfs://nameservice1/tmp"))
val hBaseRDD = sc.newAPIHadoopRDD(job.getConfiguration, classOf[TableSnapshotInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
hBaseRDD.count()
System.exit(0)
}
def convertScanToString(scan : Scan) = {
val proto = ProtobufUtil.toScan(scan);
Base64.encodeBytes(proto.toByteArray());
}
}
Looking at the Job information, its making a copy of the conf object you are supplying to it (The Job makes a copy of the Configuration so that any necessary internal modifications do not reflect on the incoming parameter.) so most likely the information that you need to set on the conf object isn't getting passed down to Spark. You could instead use TableSnapshotInputFormatImpl which has a similar method that works on conf objects. There might be additional things needed but at first pass through the problem this seems like the most likely cause.
As pointed out in the comments, another option is to use job.getConfiguration to get the updated config from the job object.
You have not configured your M/R job properly:
This is example in Java on how to configure M/R over snapshots:
Job job = new Job(conf);
Scan scan = new Scan();
TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName,
scan, MyTableMapper.class, MyMapKeyOutput.class,
MyMapOutputValueWritable.class, job, true);
}
You, definitely, skipped Scan. I suggest you taking a look at TableMapReduceUtil's initTableSnapshotMapperJob implementation to get idea how to configure job in Spark/Scala.
Here is complete configuration in mapreduce Java
TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName, // Name of the snapshot
scan, // Scan instance to control CF and attribute selection
DefaultMapper.class, // mapper class
NullWritable.class, // mapper output key
Text.class, // mapper output value
job,
true,
restoreDir);