How to Run Apache Tika on Apache Spark - scala

I am trying to run Apache Tika on Apache Spark on AWS EMR to perform distributed text extraction on a large collection of documents. I have built the Tika JAR with shaded dependencies as explained in https://forums.databricks.com/questions/28378/trying-to-use-apache-tika-on-databricks.html and the job works correctly in local mode. However when running the job in clustered mode, the extracted text always comes out as an empty string. This problem is outlined in Tika's documentation (https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-NoContentExtracted), but I haven't been able to debug the issue. Since the code works for me in local mode it has to be something with the classpath or JARs, and I can't figure it out.
Here is sample Scala code for my Spark job:
/* TikaTest.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.tika.parser._
import org.apache.tika.sax.BodyContentHandler
import org.apache.tika.metadata.Metadata
import java.io.DataInputStream
// The first argument must be an S3 path to a directory with documents for text extraction.
// The second argument must be an S3 path to a directory where extracted text will be written.
object TikaTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Tika Test")
val sc = new SparkContext(conf)
val binRDD = sc.binaryFiles(args(0))
val textRDD = binRDD.map(file => {parseFile(file._2.open( ))})
textRDD.saveAsTextFile(args(1))
sc.stop()
}
def parseFile(stream: DataInputStream): String = {
val parser = new AutoDetectParser()
val handler = new BodyContentHandler()
val metadata = new Metadata()
val context = new ParseContext()
parser.parse(stream, handler, metadata, context)
return handler.toString()
}
}

Related

spark scala datastax csv load file and print schema

Spark version 2.0.2.6
Scala version 2.11.11
Using DataStax 5.0
import org.apache.log4j.{Level, Logger}
import java.util.Calendar
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
object csvtocassandra {
def main(args: Array[String]): Unit = {
val key_space = scala.io.StdIn.readLine("Please enter cassandra Key Space Name: ")
val table_name = scala.io.StdIn.readLine("Please enter cassandra Table Name: ")
// Cassandra Part
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
println(Calendar.getInstance.getTime)
// Scala Read CSV Part
val spark1 = org.apache.spark.sql.SparkSession.builder().master("local").config("spark.cassandra.connection.host", "127.0.0.1")
.appName("Spark SQL basic example").getOrCreate()
val csv_input = scala.io.StdIn.readLine("Please enter csv file location: ")
val df_csv = spark1.read.format("csv").option("header", "true").option("inferschema", "true").load(csv_input)
df_csv.printSchema()
}
}
Why am I not able to run this program as a Job trying to submit it to spark. When I run this program using IntelliJ it works.
But When I create a JAR and run it I am getting following Error.
Command:
> dse spark-submit --class "csvtospark" /Users/del/target/scala-2.11/csvtospark_2.11-1.0.jar
I am getting following Error:
ERROR 2017-11-02 11:46:10,245 org.apache.spark.deploy.DseSparkSubmitBootstrapper: Failed to start or submit Spark application
org.apache.spark.sql.AnalysisException: Path does not exist: dsefs://127.0.0.1/Users/Desktop/csv/example.csv;
Why is it appending dsefs://127.0.0.1 part even though I am giving just the path /Users/Desktop/csv/example.csv when asked.
I tried giving --mater option as well. How ever I am getting the same error. I am running DataStax Spark in Local Machine. No Cluster.
Please correct me where I am doing things wrong.
Got it. Never mind. Sorry about that.
input should be file:///file_name

Setting UP Intellij to run apache spark with remote master

I have a project setup with H2o. I am able to run the code with apache toree and I set the spark master as spark://xxxx.yyyy.zzzz:port.
It works fine and I can see the output in spark UI.
I am trying to run the same code as application in intellij with but I get error java.lang.ClassNotFoundException: org.apache.spark.h2o.utils.NodeDesc. but I see the application in Spark UI for a short amount of Time.
I tried running simple application with hello world and that worked as well as I am able to see the application in Spark UI,
import java.io.File
import hex.tree.gbm.GBM
import hex.tree.gbm.GBMModel.GBMParameters
import org.apache.spark.h2o.{StringHolder, H2OContext}
import org.apache.spark.{SparkFiles, SparkContext, SparkConf}
import water.fvec.H2OFrame
/**
* Example of Sparkling Water based application.
*/
object SparklingWaterDroplet {
def main(args: Array[String]) {
// Create Spark Context
val conf = configure("Sparkling Water Droplet")
val sc = new SparkContext(conf)
// Create H2O Context
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext.implicits._
// Register file to be available on all nodes
sc.addFile(this.getClass.getClassLoader.getResource("iris.csv").getPath)
// Load data and parse it via h2o parser
val irisTable = new H2OFrame(new File(SparkFiles.get("iris.csv")))
// Build GBM model
val gbmParams = new GBMParameters()
gbmParams._train = irisTable
gbmParams._response_column = 'class
gbmParams._ntrees = 5
val gbm = new GBM(gbmParams)
val gbmModel = gbm.trainModel.get
// Make prediction on train data
val predict = gbmModel.score(irisTable)('predict)
// Compute number of mispredictions with help of Spark API
val trainRDD = h2oContext.asRDD[StringHolder](irisTable('class))
val predictRDD = h2oContext.asRDD[StringHolder](predict)
// Make sure that both RDDs has the same number of elements
assert(trainRDD.count() == predictRDD.count)
val numMispredictions = trainRDD.zip(predictRDD).filter( i => {
val act = i._1
val pred = i._2
act.result != pred.result
}).collect()
println(
s"""
|Number of mispredictions: ${numMispredictions.length}
|
|Mispredictions:
|
|actual X predicted
|------------------
|${numMispredictions.map(i => i._1.result.get + " X " + i._2.result.get).mkString("\n")}
""".stripMargin)
// Shutdown application
sc.stop()
}
def configure(appName:String = "Sparkling Water Demo"):SparkConf = {
val conf = new SparkConf().setAppName(appName)
.setMaster("spark://xxx.yyy.zz.aaaa:oooo")
conf
}
}
I also tried exporting jars as compile from the dependencies menu ,
is there anything I am missing from Intellij setup.?
It looks like the external libraries are not getting pushed to the master

Apache Ignite Scala Program brings up Ignite Shell and does not progress

This very simple Apache Ignite Scala program is bringing up Ignite Shell and is not progressing further beyond the IgniteContext line ; It just waits, typical of a REPL shell ; What change do I need to make to not bring up Ignite Shell? All I want to do is store data to ignite cache and then read data from ignite cache from within a scala/spark program ...
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.ignite.spark._
import org.apache.ignite.configuration._
object IgniteIt {
def main(args: Array[String]) {
println("\n==========\nIgnite!\n==========\n")
val cf = new SparkConf().setAppName("Ignite")
val sc = new SparkContext(cf)
val igniteContext = new IgniteContext(sc, "cfg/example-cache.xml")
val cacheRdd: org.apache.ignite.spark.IgniteRDD[Int,String] = igniteContext.fromCache("partitioned")
val data = Array((1,"One"),(2,"two"),(3,"three"),(4,"four"),(5,"five"))
val distData = sc.parallelize(data)
cacheRdd.savePairs(distData)
val result = cacheRdd.filter(_._2.contains("three")).collect()
result.foreach(println)
igniteContext.close(false)
println("\n==========\nDone!\n==========\n")
}
}
I think, you do not start Ignite.sh before IgniteContest invoked.
You need to do:
cd $IGNITE_HOME
bin/ignite.sh

Running Mlib via Spark Job Server

I was practising developing sample model using online resources provided in spark website. I managed to create the model and run it for sample data using Spark-Shell , But how to do actually run the model in production environment ? Is it via Spark Job server ?
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
The above code works perfect when i run it in spark-shell , But i have no idea how do we actually run model in production environment. I tried to run it via spark jobserver but i get error ,
curl -d "input.string = 1, 2, 3, 4, 5, 6, 7, 8, 9" 'ptfhadoop01v:8090/jobs?appName=SQL&classPath=spark.jobserver.SparkPredict'
I am sure its because am passing a String value whereas the program expects it be vector elements , Can someone guide me on how to achieve this . And also is this how the data being passed to Model in production environment ? Or is it some other way.
Spark Job-server is used in production use-cases, where you want to design pipelines of Spark jobs, and also (optionally) use the SparkContext across jobs, over a REST API. Sparkplug is an alternative to Spark Job-server, providing similar constructs.
However, to answer your question on how to run a (singular) Spark job in production environments, the answer is you do not need a third-party library to do so. You only need to construct a SparkContext object, and use it to trigger Spark jobs. For instance, for your code snippet, all that is needed is;
package runner
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import com.typesafe.config.{ConfigFactory, Config}
import org.apache.spark.{SparkConf, SparkContext}
/**
*
*/
object SparkRunner {
def main (args: Array[String]){
val config: Config = ConfigFactory.load("app-default-config") /*Use a library to read a config file*/
val sc: SparkContext = constructSparkContext(config)
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
}
def constructSparkContext(config: Config): SparkContext = {
val conf = new SparkConf()
conf
.setMaster(config.getString("spark.master"))
.setAppName(config.getString("app.name"))
/*Set more configuration values here*/
new SparkContext(conf)
}
}
Optionally, you can also use the wrapper for spark-submit script, SparkSubmit, provided in the Spark library itself.

Will standalone scala program takes advantage of distributed/parallel processing? or does spark Scala require separate code?

First of all sorry for asking the basic doubt over here, but still explanation for the below will be appreciable..
i am very new to scala and spark, so my doubt is if i write a standalone scala program, and execute it on spark(1 master 3 worker), will the scala program takes advantage of disturbed/parallel processing, or should i need to write a separate program to get an advantage of distributed processing??
For example, we have a scala code that process a particular formatted file to comma separated file, it takes a directory as input and parses all file and write an output to single file(each file will be usually 100-200MB). So here is the code.
import scala.io.Source
import java.io.File
import java.io.PrintWriter
import scala.collection.mutable.ListBuffer
import java.util.Calendar
//import scala.io.Source
//import org.apache.spark.SparkContext
//import org.apache.spark.SparkContext._
//import org.apache.spark.SparkConf
object Parser {
def main(args:Array[String]) {
//val conf = new SparkConf().setAppName("fileParsing").setMaster("local[*]")
//val sc = new SparkContext(conf)
var inp = new File(args(0))
var ext: String = ""
if(args.length == 1)
{ ext = "log" } else { ext = args(1) }
var files: List[String] = List("")
if (inp.exists && inp.isDirectory) {
files = getListOfFiles(inp,ext)
}
else if(inp.exists ) {
files = List(inp.toString)
}
else
{
println("Enter the correct Directory/File name");
System.exit(0);
}
if(files.length <=0 )
{
println(s"No file found with extention '.$ext'")
}
else{
var out_file_name = "output_"+Calendar.getInstance().getTime.toString.replace(" ","-").replace(":","-")+".log"
var data = getHeader(files(0))
var writer=new PrintWriter(new File(out_file_name))
var record_count = 0
//var allrecords = data.mkString(",")+("\n")
//writer.write(allrecords)
for(eachFile <- files)
{
record_count += parseFile(writer,data,eachFile)
}
writer.close()
println(record_count +s" processed into $out_file_name")
}
//all func are defined here.
}
Files from the specific dir are read using scala.io
Source.fromFile(file).getLines
So my doubt is will the above code(standalone prg) can be executed on distributed spark system? will i get an advantage of parallel processing??
ok, how about using sc to read file, will it then uses distributed processing
val conf = new SparkConf().setAppName("fileParsing").setMaster("local[*]")
val sc = new SparkContext(conf)
...
...
for(eachFile <- files)
{
record_count += parseFile(sc,writer,data,eachFile)
}
------------------------------------
def parseFile(......)
sc.textFile(file).getLines
So if i edit the top code to make use of sc then will it process on distributes spark system.
No it won't. To make use of distributed computing using Spark, you need to use SparkContext.
If you run the application you have provided using spark-submit you will not be using the Spark cluster at all. You have to rewrite it to use the SparkContext. Please read through the Spark Programming Guide.
It is extremely helpful to watch some introductory videos on Youtube for getting to know how Apache Spark works in general.
For example, these:
https://www.youtube.com/watch?v=7k4yDKBYOcw
https://www.youtube.com/watch?v=rvDpBTV89AM&list=PLF6snu5Jy-v-WRAcCfWNHks7lcNO-zrTI&index=4
Is is very important to understand it for using Spark.
"advantage of distributed processing"
Using Spark can give you advantages of distributing processing on multiple server cluster. So if you are going to move your application later to the cluster, it makes sense to develop application using Spark model and corresponding API.
Well, you can run Spark application locally on your local machine but in this case you won't get all the advantages the Spark can provide.
Anyway, as it is said before, Spark is a special framework with its own libraries for developtment. So you have to rewrite your application using Spark context and Spark API, i.e. special objects like RDDs or Dataframes and corresponding methods.