Spark JSON DStream Print() / saveAsTextFiles not working - scala

Issue Description:
Spark Version: 1.6.2
Execution: Spark-shell (REPL) master = local[2] (tried local[*])
example.json is as below:
{"name":"D2" ,"lovesPandas":"Y"}
{"name":"D3" ,"lovesPandas":"Y"}
{"name":"D4" ,"lovesPandas":"Y"}
{"name":"D5" ,"lovesPandas":"Y"}
Code executing in Spark-shell local mode:
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.kafka._
import org.apache.spark.sql._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import _root_.kafka.serializer.StringDecoder
import _root_.kafka.serializer.Decoder
import _root_.kafka.utils.VerifiableProperties
import org.apache.hadoop.hbase._
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapred.JobConf
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
val ssc = new StreamingContext(sc, Seconds(2) )
val messages = ssc.textFileStream("C:\\pdtemp\\test\\example.json")
messages.print()
I tried the saveAsTextFiles but it is not saving any files too.
this does not work -- shows no output -- tried the same with reading stream from Kafka on a spark-shell
tried the following too -- does not work:
messages.foreachRDD(rdd => rdd.foreach(print))
Also, tried parsing the schema converting to dataframe but nothing seems to work
normal json parsing is working and i can print the contents of normal //RDD/DF //to console in Spark-shell
Can anyone help, please?

Related

Packaging scala class on databricks (error: not found: value dbutils)

Trying to make a package with a class
package x.y.Log
import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.{DataFrame}
import org.apache.spark.sql.functions.{lit, explode, collect_list, struct}
import org.apache.spark.sql.types.{StructField, StructType}
import java.util.Calendar
import java.text.SimpleDateFormat
import org.apache.spark.sql.functions._
import spark.implicits._
class Log{
...
}
Everything runs fine on same notebook, but once I try to create package that I could use in other notebooks I get errors:
<notebook>:11: error: not found: object spark
import spark.implicits._
^
<notebook>:21: error: not found: value dbutils
val notebookPath = dbutils.notebook.getContext().notebookPath.get
^
<notebook>:22: error: not found: value dbutils
val userName = dbutils.notebook.getContext.tags("user")
^
<notebook>:23: error: not found: value dbutils
val userId = dbutils.notebook.getContext.tags("userId")
^
<notebook>:41: error: not found: value spark
var rawMeta = spark.read.format("json").option("multiLine", true).load("/FileStore/tables/xxx.json")
^
<notebook>:42: error: value $ is not a member of StringContext
.filter($"Name".isin(readSources))
Anyone knows how to package this class with these libs?
Assuming you are running Spark 2.x, the statement import spark.implicits._ only works when you have SparkSession object in the scope. The object Implicits is defined inside the SparkSession object. This object extends the SQLImplicits from previous verisons of spark Link to SparkSession code on Github. You can check the link to verify
package x.y.Log
import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{lit, explode, collect_list, struct}
import org.apache.spark.sql.types.{StructField, StructType}
import java.util.Calendar
import java.text.SimpleDateFormat
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
class Log{
val spark: SparkSession = SparkSession.builder.enableHiveSupport().getOrCreate()
import spark.implicits._
...[rest of the code below]
}

Returns Null when reading data from XML

I am trying to parse data from a XML file through Spark using databrics library
Here is my code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions
import java.text.Format
import org.apache.spark.sql.functions.concat_ws
import org.apache.spark.sql
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.plans.logical.With
import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.functions.udf
import scala.sys.process._
import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions._
object printschema
{
def main(args: Array[String]): Unit =
{
val conf = new SparkConf().setAppName("printschema").setMaster("local")
conf.set("spark.debug.maxToStringFields", "10000000")
val context = new SparkContext(conf)
val sqlCotext = new SQLContext(context)
import sqlCotext.implicits._
val df = sqlCotext.read.format("com.databricks.spark.xml")
.option("rowTag", "us-bibliographic-data-application")
.option("treatEmptyValuesAsNulls", true)
.load("/Users/praveen/Desktop/ipa0105.xml")
val q1= df.withColumn("document",$"application-reference.document-id.doc-number".cast(sql.types.StringType))
.withColumn("document_number",$"application-reference.document-id.doc-number".cast(sql.types.StringType)).select("document","document_number").collect()
for(l<-q1)
{
val m1=l.get(0)
val m2=l.get(1)
println(m1,m2)
}
}
}
When I run the code on ScalaIDE/IntelliJ IDEA it works fine and here is my Output.
(14789882,14789882)
(14755945,14755945)
(14755919,14755919)
But, when I build a jar and execute it by using spark-submit it returns simply null values
OUTPUT :
NULL,NULL
NULL,NULL
NULL,NULL
Here is my Spark submit:
./spark-submit --jars /home/hadoop/spark-xml_2.11-0.4.0.jar --class inndata.praveen --master local[2] /home/hadoop/ip/target/scala-2.11/ip_2.11-1.0.jar

Spark.sql and sqlContext.sql

I have imported the below modules. I tried to load data from sqlCtx.read.format, I am getting "IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"" error, but it works well when I use spark.read.format. I am seeing same behavior when I am retrieving data from registered temptable/view. What can I add extra to use sqlCtx.sql instead of spark.sql?
import os
import sys
import pandas as pd
import odbc as pyodbc
import os
import sys
import re
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.functions import *
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pyspark.sql.functions as func
import matplotlib.patches as mpatches
import time as time
from matplotlib.patches import Rectangle
import datetime
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("AppName")
sqlCtx = SQLContext(sc)
I spent two hours of my life in this one, just to realize I did not need:
sqlCtx = SQLContext(sc)
Just using SQLContext.read.(...), solved this in my case.

Why recommendProductsForUsers is not a member of org.apache.spark.mllib.recommendation.MatrixFactorizationModel

i have build recommendations system using Spark with ALS collaboratife filtering mllib
my snippet code :
bestModel.get
.predict(toBePredictedBroadcasted.value)
evrything is ok, but i need change code for fullfilment requirement, i read from scala doc in here
i need to use def recommendProducts
but when i tried in my code :
bestModel.get.recommendProductsForUsers(100)
and error when compile :
value recommendProductsForUsers is not a member of org.apache.spark.mllib.recommendation.MatrixFactorizationModel
[error] bestModel.get.recommendProductsForUsers(100)
maybe anyone can help me
thx
NB : i use Spark 1.5.0
my import :
import com.datastax.spark.connector._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import java.io.File
import scala.io.Source
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.rdd._
import org.apache.spark.mllib.recommendation.{ALS, Rating, MatrixFactorizationModel}
import org.apache.spark.sql.SQLContext
import org.apache.spark.broadcast.Broadcast

How can I load Avros in Spark using the schema on-board the Avro file(s)?

I am running CDH 4.4 with Spark 0.9.0 from a Cloudera parcel.
I have a bunch of Avro files that were created via Pig's AvroStorage UDF. I want to load these files in Spark, using a generic record or the schema onboard the Avro files. So far I've tried this:
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.hadoop.io.NullWritable
import org.apache.commons.lang.StringEscapeUtils.escapeCsv
import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.conf.Configuration
import java.net.URI
import java.io.BufferedInputStream
import java.io.File
import org.apache.avro.generic.{GenericDatumReader, GenericRecord}
import org.apache.avro.specific.SpecificDatumReader
import org.apache.avro.file.DataFileStream
import org.apache.avro.io.DatumReader
import org.apache.avro.file.DataFileReader
import org.apache.avro.mapred.FsInput
val input = "hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00016.avro"
val inURI = new URI(input)
val inPath = new Path(inURI)
val fsInput = new FsInput(inPath, sc.hadoopConfiguration)
val reader = new GenericDatumReader[GenericRecord]
val dataFileReader = DataFileReader.openReader(fsInput, reader)
val schemaString = dataFileReader.getSchema
val buf = scala.collection.mutable.ListBuffer.empty[GenericRecord]
while(dataFileReader.hasNext) {
buf += dataFileReader.next
}
sc.parallelize(buf)
This works for one file, but it can't scale - I am loading all the data into local RAM and then distributing it across the spark nodes from there.
To answer my own question:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapred.AvroInputFormat
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.hadoop.io.NullWritable
import org.apache.commons.lang.StringEscapeUtils.escapeCsv
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.hadoop.conf.Configuration
import java.io.BufferedInputStream
import org.apache.avro.file.DataFileStream
import org.apache.avro.io.DatumReader
import org.apache.avro.file.DataFileReader
import org.apache.avro.file.DataFileReader
import org.apache.avro.generic.{GenericDatumReader, GenericRecord}
import org.apache.avro.mapred.FsInput
import org.apache.avro.Schema
import org.apache.avro.Schema.Parser
import org.apache.hadoop.mapred.JobConf
import java.io.File
import java.net.URI
// spark-shell -usejavacp -classpath "*.jar"
val input = "hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00016.avro"
val jobConf= new JobConf(sc.hadoopConfiguration)
val rdd = sc.hadoopFile(
input,
classOf[org.apache.avro.mapred.AvroInputFormat[GenericRecord]],
classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]],
classOf[org.apache.hadoop.io.NullWritable],
10)
val f1 = rdd.first
val a = f1._1.datum
a.get("rawLog") // Access avro fields
This works for me:
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper}
import org.apache.hadoop.io.NullWritable
...
val path = "hdfs:///path/to/your/avro/folder"
val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](path)