How to define multiple Custom Delimiter for input file in spark? - scala

The default input file delimiter while reading a file via Spark is newline character(\n). It is possible to define a custom delimiter by using "textinputformat.record.delimiter" property.
But, Is it possible to specify multiple delimiter for the same file ?
Suppose a file has following content :
COMMENT,A,B,C
COMMENT,D,E,
F
LIKE,I,H,G
COMMENT,J,K,
L
COMMENT,M,N,O
I want to read this file with delimiter as COMMENT and LIKE instead of newline character.
Although, i came up with an alternative if multiple delimiters are not allowed in spark.
val ss = SparkSession.builder().appName("SentimentAnalysis").master("local[*]").getOrCreate()
val sc = ss.sparkContext
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "COMMENT")
val rdd = sc.textFile("<filepath>")
val finalRdd = rdd.flatmap(f=>f.split("LIKE"))
But still, i think it will better to have multiple custom delimiter. Is it possible in spark ? or do i have to use the above alternative ?

Solved the above issue by creating a custom TextInputFormat class that splits on two types delimiter strings. The post pointed by #puhlen in the comments was a great help. Find below the code snippet which i used :
class CustomInputFormat extends TextInputFormat {
override def createRecordReader(inputSplit: InputSplit, taskAttemptContext: TaskAttemptContext): RecordReader[LongWritable, Text] = {
return new ParagraphRecordReader();
}
}
class ParagraphRecordReader extends RecordReader[LongWritable, Text] {
var end: Long = 0L;
var stillInChunk = true;
var key = new LongWritable();
var value = new Text();
var fsin: FSDataInputStream = null;
val buffer = new DataOutputBuffer();
val tempBuffer1 = MutableList[Int]();
val tempBuffer2 = MutableList[Int]();
val endTag1 = "COMMENT".getBytes();
val endTag2 = "LIKE".getBytes();
#throws(classOf[IOException])
#throws(classOf[InterruptedException])
override def initialize(inputSplit: org.apache.hadoop.mapreduce.InputSplit, taskAttemptContext: org.apache.hadoop.mapreduce.TaskAttemptContext) {
val split = inputSplit.asInstanceOf[FileSplit];
val conf = taskAttemptContext.getConfiguration();
val path = split.getPath();
val fs = path.getFileSystem(conf);
fsin = fs.open(path);
val start = split.getStart();
end = split.getStart() + split.getLength();
fsin.seek(start);
if (start != 0) {
readUntilMatch(endTag1, endTag2, false);
}
}
#throws(classOf[IOException])
override def nextKeyValue(): Boolean = {
if (!stillInChunk) return false;
val status = readUntilMatch(endTag1, endTag2, true);
value = new Text();
value.set(buffer.getData(), 0, buffer.getLength());
key = new LongWritable(fsin.getPos());
buffer.reset();
if (!status) {
stillInChunk = false;
}
return true;
}
#throws(classOf[IOException])
#throws(classOf[InterruptedException])
override def getCurrentKey(): LongWritable = {
return key;
}
#throws(classOf[IOException])
#throws(classOf[InterruptedException])
override def getCurrentValue(): Text = {
return value;
}
#throws(classOf[IOException])
#throws(classOf[InterruptedException])
override def getProgress(): Float = {
return 0;
}
#throws(classOf[IOException])
override def close() {
fsin.close();
}
#throws(classOf[IOException])
def readUntilMatch(match1: Array[Byte], match2: Array[Byte], withinBlock: Boolean): Boolean = {
var i = 0;
var j = 0;
while (true) {
val b = fsin.read();
if (b == -1) return false;
if (b == match1(i)) {
tempBuffer1.+=(b)
i = i + 1;
if (i >= match1.length) {
tempBuffer1.clear()
return fsin.getPos() < end;
}
} else if (b == match2(j)) {
tempBuffer2.+=(b)
j = j + 1;
if (j >= match2.length) {
tempBuffer2.clear()
return fsin.getPos() < end;
}
} else {
if (tempBuffer1.size != 0)
tempBuffer1.foreach { x => if (withinBlock) buffer.write(x) }
else if (tempBuffer2.size != 0)
tempBuffer2.foreach { x => if (withinBlock) buffer.write(x) }
tempBuffer1.clear()
tempBuffer2.clear()
if (withinBlock) buffer.write(b);
i = 0;
j = 0;
}
}
return false;
}
Use the following class in while reading file from filesystem and your file will get read with two delimiters as required. :)
val rdd = sc.newAPIHadoopFile("<filepath>", classOf[ParagraphInputFormat], classOf[LongWritable], classOf[Text], sc.hadoopConfiguration)

Related

why mapPartitions does not see my val - SCALA/SPARK?

I define val like this :
val config = Config(args)
val product_type = config.product_type
then I send product_type as "AA"
and my code is this :
val scores = df.mapPartitions(iterator => {
val inputStream =
if(product_type == "AA" ) {
getClass().getClassLoader().getResourceAsStream("my_aa.hdf5")
}
else {
getClass().getClassLoader().getResourceAsStream("my_bb.hdf5")
}
val multiLayerNetwork: MultiLayerNetwork = KerasModelImport.importKerasSequentialModelAndWeights(inputStream, false)
val wrapped: ParallelInference = new ParallelInference.Builder(multiLayerNetwork).build()
val res = iterator.map(row => {
wrapped.output(row).toDoubleVector
})
res
})
But my inputStream equals "my_bb.hdf5" which is not correct. This value comes from else statement. So why my product_type variable cant read in mappartition?
I print my product_type value before code and I checked it , it is : "AA"
it occurs because of i get this variable from argument in spark submit.sh
and it can not read from mappartition.
It works like this:
val scores =
if (product_type == "AA") {
df.mapPartitions(iterator => {
val inputStream = getClass().getClassLoader().getResourceAsStream("AA.hdf5")
val multiLayerNetwork: MultiLayerNetwork = KerasModelImport.importKerasSequentialModelAndWeights(inputStream, false)
val wrapped: ParallelInference = new ParallelInference.Builder(multiLayerNetwork).build()
val res = iterator.map(row => {
wrapped.output(row).toDoubleVector
})
res
})
} else {
df.mapPartitions(iterator => {
val inputStream = getClass().getClassLoader().getResourceAsStream("BB.hdf5")
val multiLayerNetwork: MultiLayerNetwork = KerasModelImport.importKerasSequentialModelAndWeights(inputStream, false)
val wrapped: ParallelInference = new ParallelInference.Builder(multiLayerNetwork).build()
val res = iterator.map(row => {
wrapped.output(row).toDoubleVector
})
res
})
}

Scala code generates type mismatch (ScalaPB)

I have a protobuf file ...
which I transform to a Scala file using ScalaPB. This way I can then ...
use it inside my Juypter notebook* for transformation. Sadly, when I ...
run the specific cell I get an type
mismatch error and I don't know why?
As the protobuf file is working with Python and the Scala code is generated, what is not right here? Could this be a bug?
*The notebook uses com.google.protobuf:protobuf-java:3.5.0,com.thesamet.scalapb:sparksql-scalapb_2.11:0.7.0 as imports
Sources & Error
protobuf file:
syntax = "proto2";
import "scalapb/scalapb.proto";
option (scalapb.options) = {
flat_package: true
single_file: true
};
message JSONEntry {
required uint64 ts = 1;
required string data = 2;
}
message JSONOutput {
optional string metadata = 1;
repeated JSONEntry entry = 2;
}
Scala (generated) code
// Generated by the Scala Plugin for the Protocol Buffer Compiler.
// Do not edit!
//
// Protofile syntax: PROTO2
#SerialVersionUID(0L)
final case class JSONEntry(
ts: _root_.scala.Long,
data: _root_.scala.Predef.String
) extends scalapb.GeneratedMessage with scalapb.Message[JSONEntry] with scalapb.lenses.Updatable[JSONEntry] {
#transient
private[this] var __serializedSizeCachedValue: _root_.scala.Int = 0
private[this] def __computeSerializedValue(): _root_.scala.Int = {
var __size = 0
__size += _root_.com.google.protobuf.CodedOutputStream.computeUInt64Size(1, ts)
__size += _root_.com.google.protobuf.CodedOutputStream.computeStringSize(2, data)
__size
}
final override def serializedSize: _root_.scala.Int = {
var read = __serializedSizeCachedValue
if (read == 0) {
read = __computeSerializedValue()
__serializedSizeCachedValue = read
}
read
}
def writeTo(`_output__`: _root_.com.google.protobuf.CodedOutputStream): Unit = {
_output__.writeUInt64(1, ts)
_output__.writeString(2, data)
}
def mergeFrom(`_input__`: _root_.com.google.protobuf.CodedInputStream): JSONEntry = {
var __ts = this.ts
var __data = this.data
var __requiredFields0: _root_.scala.Long = 0x3L
var _done__ = false
while (!_done__) {
val _tag__ = _input__.readTag()
_tag__ match {
case 0 => _done__ = true
case 8 =>
__ts = _input__.readUInt64()
__requiredFields0 &= 0xfffffffffffffffeL
case 18 =>
__data = _input__.readString()
__requiredFields0 &= 0xfffffffffffffffdL
case tag => _input__.skipField(tag)
}
}
if (__requiredFields0 != 0L) { throw new _root_.com.google.protobuf.InvalidProtocolBufferException("Message missing required fields.") }
JSONEntry(
ts = __ts,
data = __data
)
}
def withTs(__v: _root_.scala.Long): JSONEntry = copy(ts = __v)
def withData(__v: _root_.scala.Predef.String): JSONEntry = copy(data = __v)
def getFieldByNumber(__fieldNumber: _root_.scala.Int): scala.Any = {
(__fieldNumber: #_root_.scala.unchecked) match {
case 1 => ts
case 2 => data
}
}
def getField(__field: _root_.scalapb.descriptors.FieldDescriptor): _root_.scalapb.descriptors.PValue = {
require(__field.containingMessage eq companion.scalaDescriptor)
(__field.number: #_root_.scala.unchecked) match {
case 1 => _root_.scalapb.descriptors.PLong(ts)
case 2 => _root_.scalapb.descriptors.PString(data)
}
}
def toProtoString: _root_.scala.Predef.String = _root_.scalapb.TextFormat.printToUnicodeString(this)
def companion = JSONEntry
}
object JSONEntry extends scalapb.GeneratedMessageCompanion[JSONEntry] {
implicit def messageCompanion: scalapb.GeneratedMessageCompanion[JSONEntry] = this
def fromFieldsMap(__fieldsMap: scala.collection.immutable.Map[_root_.com.google.protobuf.Descriptors.FieldDescriptor, scala.Any]): JSONEntry = {
require(__fieldsMap.keys.forall(_.getContainingType() == javaDescriptor), "FieldDescriptor does not match message type.")
val __fields = javaDescriptor.getFields
JSONEntry(
__fieldsMap(__fields.get(0)).asInstanceOf[_root_.scala.Long],
__fieldsMap(__fields.get(1)).asInstanceOf[_root_.scala.Predef.String]
)
}
implicit def messageReads: _root_.scalapb.descriptors.Reads[JSONEntry] = _root_.scalapb.descriptors.Reads{
case _root_.scalapb.descriptors.PMessage(__fieldsMap) =>
require(__fieldsMap.keys.forall(_.containingMessage == scalaDescriptor), "FieldDescriptor does not match message type.")
JSONEntry(
__fieldsMap.get(scalaDescriptor.findFieldByNumber(1).get).get.as[_root_.scala.Long],
__fieldsMap.get(scalaDescriptor.findFieldByNumber(2).get).get.as[_root_.scala.Predef.String]
)
case _ => throw new RuntimeException("Expected PMessage")
}
def javaDescriptor: _root_.com.google.protobuf.Descriptors.Descriptor = DataProto.javaDescriptor.getMessageTypes.get(0)
def scalaDescriptor: _root_.scalapb.descriptors.Descriptor = DataProto.scalaDescriptor.messages(0)
def messageCompanionForFieldNumber(__number: _root_.scala.Int): _root_.scalapb.GeneratedMessageCompanion[_] = throw new MatchError(__number)
lazy val nestedMessagesCompanions: Seq[_root_.scalapb.GeneratedMessageCompanion[_]] = Seq.empty
def enumCompanionForFieldNumber(__fieldNumber: _root_.scala.Int): _root_.scalapb.GeneratedEnumCompanion[_] = throw new MatchError(__fieldNumber)
lazy val defaultInstance = JSONEntry(
ts = 0L,
data = ""
)
implicit class JSONEntryLens[UpperPB](_l: _root_.scalapb.lenses.Lens[UpperPB, JSONEntry]) extends _root_.scalapb.lenses.ObjectLens[UpperPB, JSONEntry](_l) {
def ts: _root_.scalapb.lenses.Lens[UpperPB, _root_.scala.Long] = field(_.ts)((c_, f_) => c_.copy(ts = f_))
def data: _root_.scalapb.lenses.Lens[UpperPB, _root_.scala.Predef.String] = field(_.data)((c_, f_) => c_.copy(data = f_))
}
final val TS_FIELD_NUMBER = 1
final val DATA_FIELD_NUMBER = 2
}
#SerialVersionUID(0L)
final case class JSONOutput(
metadata: scala.Option[_root_.scala.Predef.String] = None,
entry: _root_.scala.collection.Seq[JSONEntry] = _root_.scala.collection.Seq.empty
) extends scalapb.GeneratedMessage with scalapb.Message[JSONOutput] with scalapb.lenses.Updatable[JSONOutput] {
#transient
private[this] var __serializedSizeCachedValue: _root_.scala.Int = 0
private[this] def __computeSerializedValue(): _root_.scala.Int = {
var __size = 0
if (metadata.isDefined) { __size += _root_.com.google.protobuf.CodedOutputStream.computeStringSize(1, metadata.get) }
entry.foreach(entry => __size += 1 + _root_.com.google.protobuf.CodedOutputStream.computeUInt32SizeNoTag(entry.serializedSize) + entry.serializedSize)
__size
}
final override def serializedSize: _root_.scala.Int = {
var read = __serializedSizeCachedValue
if (read == 0) {
read = __computeSerializedValue()
__serializedSizeCachedValue = read
}
read
}
def writeTo(`_output__`: _root_.com.google.protobuf.CodedOutputStream): Unit = {
metadata.foreach { __v =>
_output__.writeString(1, __v)
};
entry.foreach { __v =>
_output__.writeTag(2, 2)
_output__.writeUInt32NoTag(__v.serializedSize)
__v.writeTo(_output__)
};
}
def mergeFrom(`_input__`: _root_.com.google.protobuf.CodedInputStream): JSONOutput = {
var __metadata = this.metadata
val __entry = (_root_.scala.collection.immutable.Vector.newBuilder[JSONEntry] ++= this.entry)
var _done__ = false
while (!_done__) {
val _tag__ = _input__.readTag()
_tag__ match {
case 0 => _done__ = true
case 10 =>
__metadata = Option(_input__.readString())
case 18 =>
__entry += _root_.scalapb.LiteParser.readMessage(_input__, JSONEntry.defaultInstance)
case tag => _input__.skipField(tag)
}
}
JSONOutput(
metadata = __metadata,
entry = __entry.result()
)
}
def getMetadata: _root_.scala.Predef.String = metadata.getOrElse("")
def clearMetadata: JSONOutput = copy(metadata = None)
def withMetadata(__v: _root_.scala.Predef.String): JSONOutput = copy(metadata = Option(__v))
def clearEntry = copy(entry = _root_.scala.collection.Seq.empty)
def addEntry(__vs: JSONEntry*): JSONOutput = addAllEntry(__vs)
def addAllEntry(__vs: TraversableOnce[JSONEntry]): JSONOutput = copy(entry = entry ++ __vs)
def withEntry(__v: _root_.scala.collection.Seq[JSONEntry]): JSONOutput = copy(entry = __v)
def getFieldByNumber(__fieldNumber: _root_.scala.Int): scala.Any = {
(__fieldNumber: #_root_.scala.unchecked) match {
case 1 => metadata.orNull
case 2 => entry
}
}
def getField(__field: _root_.scalapb.descriptors.FieldDescriptor): _root_.scalapb.descriptors.PValue = {
require(__field.containingMessage eq companion.scalaDescriptor)
(__field.number: #_root_.scala.unchecked) match {
case 1 => metadata.map(_root_.scalapb.descriptors.PString).getOrElse(_root_.scalapb.descriptors.PEmpty)
case 2 => _root_.scalapb.descriptors.PRepeated(entry.map(_.toPMessage)(_root_.scala.collection.breakOut))
}
}
def toProtoString: _root_.scala.Predef.String = _root_.scalapb.TextFormat.printToUnicodeString(this)
def companion = JSONOutput
}
object JSONOutput extends scalapb.GeneratedMessageCompanion[JSONOutput] {
implicit def messageCompanion: scalapb.GeneratedMessageCompanion[JSONOutput] = this
def fromFieldsMap(__fieldsMap: scala.collection.immutable.Map[_root_.com.google.protobuf.Descriptors.FieldDescriptor, scala.Any]): JSONOutput = {
require(__fieldsMap.keys.forall(_.getContainingType() == javaDescriptor), "FieldDescriptor does not match message type.")
val __fields = javaDescriptor.getFields
JSONOutput(
__fieldsMap.get(__fields.get(0)).asInstanceOf[scala.Option[_root_.scala.Predef.String]],
__fieldsMap.getOrElse(__fields.get(1), Nil).asInstanceOf[_root_.scala.collection.Seq[JSONEntry]]
)
}
implicit def messageReads: _root_.scalapb.descriptors.Reads[JSONOutput] = _root_.scalapb.descriptors.Reads{
case _root_.scalapb.descriptors.PMessage(__fieldsMap) =>
require(__fieldsMap.keys.forall(_.containingMessage == scalaDescriptor), "FieldDescriptor does not match message type.")
JSONOutput(
__fieldsMap.get(scalaDescriptor.findFieldByNumber(1).get).flatMap(_.as[scala.Option[_root_.scala.Predef.String]]),
__fieldsMap.get(scalaDescriptor.findFieldByNumber(2).get).map(_.as[_root_.scala.collection.Seq[JSONEntry]]).getOrElse(_root_.scala.collection.Seq.empty)
)
case _ => throw new RuntimeException("Expected PMessage")
}
def javaDescriptor: _root_.com.google.protobuf.Descriptors.Descriptor = DataProto.javaDescriptor.getMessageTypes.get(1)
def scalaDescriptor: _root_.scalapb.descriptors.Descriptor = DataProto.scalaDescriptor.messages(1)
def messageCompanionForFieldNumber(__number: _root_.scala.Int): _root_.scalapb.GeneratedMessageCompanion[_] = {
var __out: _root_.scalapb.GeneratedMessageCompanion[_] = null
(__number: #_root_.scala.unchecked) match {
case 2 => __out = JSONEntry
}
__out
}
lazy val nestedMessagesCompanions: Seq[_root_.scalapb.GeneratedMessageCompanion[_]] = Seq.empty
def enumCompanionForFieldNumber(__fieldNumber: _root_.scala.Int): _root_.scalapb.GeneratedEnumCompanion[_] = throw new MatchError(__fieldNumber)
lazy val defaultInstance = JSONOutput(
)
implicit class JSONOutputLens[UpperPB](_l: _root_.scalapb.lenses.Lens[UpperPB, JSONOutput]) extends _root_.scalapb.lenses.ObjectLens[UpperPB, JSONOutput](_l) {
def metadata: _root_.scalapb.lenses.Lens[UpperPB, _root_.scala.Predef.String] = field(_.getMetadata)((c_, f_) => c_.copy(metadata = Option(f_)))
def optionalMetadata: _root_.scalapb.lenses.Lens[UpperPB, scala.Option[_root_.scala.Predef.String]] = field(_.metadata)((c_, f_) => c_.copy(metadata = f_))
def entry: _root_.scalapb.lenses.Lens[UpperPB, _root_.scala.collection.Seq[JSONEntry]] = field(_.entry)((c_, f_) => c_.copy(entry = f_))
}
final val METADATA_FIELD_NUMBER = 1
final val ENTRY_FIELD_NUMBER = 2
}
object DataProto extends _root_.scalapb.GeneratedFileObject {
lazy val dependencies: Seq[_root_.scalapb.GeneratedFileObject] = Seq(
scalapb.options.ScalapbProto
)
lazy val messagesCompanions: Seq[_root_.scalapb.GeneratedMessageCompanion[_]] = Seq(
JSONEntry,
JSONOutput
)
private lazy val ProtoBytes: Array[Byte] =
scalapb.Encoding.fromBase64(scala.collection.Seq(
"""CgpkYXRhLnByb3RvGhVzY2FsYXBiL3NjYWxhcGIucHJvdG8iLwoJSlNPTkVudHJ5Eg4KAnRzGAEgAigEUgJ0cxISCgRkYXRhG
AIgAigJUgRkYXRhIkoKCkpTT05PdXRwdXQSGgoIbWV0YWRhdGEYASABKAlSCG1ldGFkYXRhEiAKBWVudHJ5GAIgAygLMgouSlNPT
kVudHJ5UgVlbnRyeUIH4j8EEAEoAQ=="""
).mkString)
lazy val scalaDescriptor: _root_.scalapb.descriptors.FileDescriptor = {
val scalaProto = com.google.protobuf.descriptor.FileDescriptorProto.parseFrom(ProtoBytes)
_root_.scalapb.descriptors.FileDescriptor.buildFrom(scalaProto, dependencies.map(_.scalaDescriptor))
}
lazy val javaDescriptor: com.google.protobuf.Descriptors.FileDescriptor = {
val javaProto = com.google.protobuf.DescriptorProtos.FileDescriptorProto.parseFrom(ProtoBytes)
com.google.protobuf.Descriptors.FileDescriptor.buildFrom(javaProto, Array(
scalapb.options.ScalapbProto.javaDescriptor
))
}
#deprecated("Use javaDescriptor instead. In a future version this will refer to scalaDescriptor.", "ScalaPB 0.5.47")
def descriptor: com.google.protobuf.Descriptors.FileDescriptor = javaDescriptor
}
Error
<console>:82: error: type mismatch;
found : JSONEntry.type
required: scalapb.GeneratedMessageCompanion[_]
def companion = JSONEntry
^
I was able to successfully compile your proto file with the following code
project/scalapb.sbt
addSbtPlugin("com.thesamet" % "sbt-protoc" % "0.99.16")
libraryDependencies += "com.thesamet.scalapb" %% "compilerplugin" % "0.7.0"
build.sbt
lazy val root = (project in file(".")).
settings(
inThisBuild(List(
organization := "com.example",
scalaVersion := "2.12.4",
version := "0.1.0-SNAPSHOT"
)),
name := "protobuf",
libraryDependencies ++= Seq(
"com.thesamet.scalapb" %% "scalapb-runtime" % scalapb.compiler.Version.scalapbVersion % "protobuf"
),
PB.targets in (Compile) := Seq(
scalapb.gen() -> (sourceManaged in Compile).value
)
)
Now if you copy and paste your photo file into src/main/protobuf as hello.proto and do a sbt clean compile
The only thing I did differently is that I added a package to the photo file
syntax = "proto2";
import "scalapb/scalapb.proto";
option (scalapb.options) = {
package_name: "com.abhi"
flat_package: true
single_file: true
};
message JSONEntry {
required uint64 ts = 1;
required string data = 2;
}
message JSONOutput {
optional string metadata = 1;
repeated JSONEntry entry = 2;
}
Now finally use the generated code in your app
package example
import com.abhi.JSONEntry
import java.io._
object Hello extends App {
val jsonEntry = JSONEntry(10L, "foo")
val target = new FileOutputStream(new File("foo.bin"))
jsonEntry.writeTo(target)
target.close()
}
The code compiles correctly and there is no compilation error

Scala: Save result in a toDf temp table

I'm trying save some analyses in toDF TempTable, but a receive the following error ":215: error: value toDF is not a member of Double".
I'm reading data of a Cassandra table, and i'm doing some calculations. I want save these results in a temp table.
I'm new in scala, somebody, can help me please?
my code
case class Consumo(consumo:Double, consumo_mensal: Double, mes: org.joda.time.DateTime,ano: org.joda.time.DateTime, soma_pf: Double,empo_gasto: Double);
object Analysegridata{
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host","127.0.0.1").setAppName("LiniarRegression")
.set("spark.cassandra.connection.port", "9042")
.set("spark.driver.allowMultipleContexts", "true")
.set("spark.streaming.receiver.writeAheadLog.enable", "true");
val sc = new SparkContext(conf);
val ssc = new StreamingContext(sc, Seconds(1))
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
val checkpointDirectory = "/var/lib/cassandra/data"
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
// val context = StreamingContext.getOrCreate(checkpointDirectory)
import sqlContext.implicits._
JavaSparkContext.fromSparkContext(sc);
def rddconsumo(rddData: Double): Double = {
val rddData: Double = {
implicit val data = conf
val grid = sc.cassandraTable("smartgrids", "analyzer").as((r:Double) => (r)).collect
def goto(cs: Array[Double]): Double = {
var consumo = 0.0;
var totaldias = 0;
var soma_pf = 0.0;
var somamc = 0.0;
var tempo_gasto = 0.0;
var consumo_mensal = 0.0;
var i=0
for (i <- 0 until grid.length) {
val minutos = sc.cassandraTable("smartgrids","analyzer_temp").select("timecol", "MINUTE");
val horas = sc.cassandraTable("smartgrids","analyzer_temp").select("timecol","HOUR_OF_DAY");
val dia = sc.cassandraTable("smartgrids","analyzer_temp").select("timecol", "DAY_OF_MONTH");
val ano = sc.cassandraTable("smartgrids","analyzer_temp").select("timecol", "YEAR");
val mes = sc.cassandraTable("smartgrids","analyzer_temp").select("timecol", "MONTH");
val potencia = sc.cassandraTable("smartgrids","analyzer_temp").select("n_pf1", "n_pf2", "n_pf3")
def convert_minutos (minuto : Int) : Double ={
minuto/60
}
dia.foreach (i => {
def adSum(potencia: Array[Double]) = {
var i=0;
while (i < potencia.length) {
soma_pf += potencia(i);
i += 1;
soma_pf;
println("Potemcia =" + soma_pf)
}
}
def tempo(minutos: Array[Int]) = {
var i=0;
while (i < minutos.length) {
somamc += convert_minutos(minutos(i))
i += 1;
somamc
}
}
def tempogasto(horas: Array[Int]) = {
var i=0;
while (i < horas.length) {
tempo_gasto = horas(i) + somamc;
i += 1;
tempo_gasto;
println("Temo que o aparelho esteve ligado =" + tempo_gasto)
}
}
def consumof(dia: Array[Int]) = {
var i=0;
while (i < dia.length) {
consumo = soma_pf * tempo_gasto;
i += 1;
consumo;
println("Consumo diario =" + consumo)
}
}
})
mes.foreach (i => {
def totaltempo(dia: Array[Int]) = {
var i = 0;
while(i < dia.length){
totaldias += dia(i);
i += 1;
totaldias;
println("Numero total de dias =" + totaldias)
}
}
def consumomensal(mes: Array[Int]) = {
var i = 0;
while(i < mes.length){
consumo_mensal = consumo * totaldias;
i += 1;
consumo_mensal;
println("Consumo Mensal =" + consumo_mensal);
}
}
})
}
consumo;
totaldias;
consumo_mensal;
soma_pf;
tempo_gasto;
somamc
}
rddData
}
rddData.toDF().registerTempTable("rddData")
}
ssc.start()
ssc.awaitTermination()
error: value toDF is not a member of Double"
It's rather unclear what you're trying to do exactly (too much code, try providing a minimal example), but there are a few apparent issues:
rddData has type Double: seems like it should be RDD[Double] (which is a distributed collection of Double values). Trying to save a single Double value as a table doesn't make much sense, and indeed - doesn't work (toDF can be called on an RDD, not any type, specifically not on Double, as the compiler warns).
you collect() the data: if you want to load an RDD, transform it using some manipulation, and then save it as a table - collect() should probably not be called on that RDD. collect() sends all the data (distributed across the cluster) into the single "driver" machine (the one running this code) - after which you're not taking advantage of the cluster, and again not using the RDD data structure so you can't convert the data into a DataFrame using toDF.

Inconsistent outputs from Scala Spark and pyspark job

I am converting my Scala code to pyspark like below, but got different counts for the final RDD.
My Scala code:
val scalaRDD = rowRDD.map {
row: Row =>
var rowList: ListBuffer[Row] = ListBuffer()
rowList.add(row)
(row.getString(1) + "_" + row.getString(6), rowList)
}.reduceByKey{ (list1,list2) =>
var rowList: ListBuffer[Row] = ListBuffer()
for (i <- 0 to list1.length -1) {
val row1 = list1.get(i);
var foundMatch = false;
breakable {
for (j <- 0 to list2.length -1) {
var row2 = list2.get(j);
val result = mergeRow(row1, row2)
if (result._1) {
list2.set(j, result._2)
foundMatch = true;
break;
}
} // for j loop
} // breakable for j
if(!foundMatch) {
rowList.add(row1);
}
}
list2.addAll(rowList);
list2
}.flatMap { t=> t._2 }
where
def mergeRow(row1:Row, row2:Row):(Boolean, Row)= {
var z:Array[String] = new Array[String](row1.length)
var hasDiff = false
for (k <- 1 to row1.length -2){
// k = 0 : ID, always different
// k = 43 : last field, which is not important
if (row1.getString(0) < row2.getString(0)) {
z(0) = row2.getString(0)
z(43) = row2.getString(43)
} else {
z(0) = row1.getString(0)
z(43) = row1.getString(43)
}
if (Option(row2.getString(k)).getOrElse("").isEmpty && !Option(row1.getString(k)).getOrElse("").isEmpty) {
z(k) = row1.getString(k)
hasDiff = true
} else if (!Option(row1.getString(k)).getOrElse("").isEmpty && !Option(row2.getString(k)).getOrElse("").isEmpty && row1.getString(k) != row2.getString(k)) {
return (false, null)
} else {
z(k) = row2.getString(k)
}
} // for k loop
if (hasDiff) {
(true, Row.fromSeq(z))
} else {
(true, row2)
}
}
I then tried to convert them to pyspark code as below:
pySparkRDD = rowRDD.map (
lambda row : singleRowList(row)
).reduceByKey(lambda list1,list2: mergeList(list1,list2)).flatMap(lambda x : x[1])
where I have:
def mergeRow(row1, row2):
z=[]
hasDiff = False
#for (k <- 1 to row1.length -2){
for k in xrange(1, len(row1) - 2):
# k = 0 : ID, always different
# k = 43 : last field, which is not important
if (row1[0] < row2[0]):
z[0] = row2[0]
z[43] = row2[43]
else:
z[0] = row1[0]
z[43] = row1[43]
if not(row2[k]) and row1[k]:
z[k] = row1[k].strip()
hasDiff = True
elif row1[k] and row2[k] and row1[k].strip() != row2[k].strip():
return (False, None)
else:
z[k] = row2[k].strip()
if hasDiff:
return (True, Row.fromSeq(z))
else:
return (True, row2)
and
def singleRowList(row):
myList=[]
myList.append(row)
return (row[1] + "_" + row[6], myList)
and
def mergeList(list1, list2):
rowList = []
for i in xrange(0, len(list1)-1):
row1 = list1[i]
foundMatch = False
for j in xrange(0, len(list2)-1):
row2 = list2[j]
resultBool, resultRow = mergeRow(row1, row2)
if resultBool:
list2[j] = resultRow
foundMatch = True
break
if foundMatch == False:
rowList.append(row1)
list2.extend(rowList)
return list2
BTW, rowRDD is converted from a data frame. i.e. rowRDD = myDF.rdd
However, I got different counts for scalaRDD and pySparkRDD. I checked the codes many times but couldn't figure out what I missed. Does anyone have any ideas? Thanks!
Consider this:
scala> (1 to 5).length
res1: Int = 5
and this:
>>> len(xrange(1, 5))
4

scala priority queue not ordering properly?

I'm seeing some strange behavior with Scala's collection.mutable.PriorityQueue. I'm performing an external sort and testing it with 1M records. Each time I run the test and verify the results between 10-20 records are not sorted properly. I replace the scala PriorityQueue implementation with a java.util.PriorityQueue and it works 100% of the time. Any ideas?
Here's the code (sorry it's a bit long...). I test it using the tools gensort -a 1000000 and valsort from http://sortbenchmark.org/
def externalSort(inFileName: String, outFileName: String)
(implicit ord: Ordering[String]): Int = {
val MaxTempFiles = 1024
val TempBufferSize = 4096
val inFile = new java.io.File(inFileName)
/** Partitions input file and sorts each partition */
def partitionAndSort()(implicit ord: Ordering[String]):
List[java.io.File] = {
/** Gets block size to use */
def getBlockSize: Long = {
var blockSize = inFile.length / MaxTempFiles
val freeMem = Runtime.getRuntime().freeMemory()
if (blockSize < freeMem / 2)
blockSize = freeMem / 2
else if (blockSize >= freeMem)
System.err.println("Not enough free memory to use external sort.")
blockSize
}
/** Sorts and writes data to temp files */
def writeSorted(buf: List[String]): java.io.File = {
// Create new temp buffer
val tmp = java.io.File.createTempFile("external", "sort")
tmp.deleteOnExit()
// Sort buffer and write it out to tmp file
val out = new java.io.PrintWriter(tmp)
try {
for (l <- buf.sorted) {
out.println(l)
}
} finally {
out.close()
}
tmp
}
val blockSize = getBlockSize
var tmpFiles = List[java.io.File]()
var buf = List[String]()
var currentSize = 0
// Read input and divide into blocks
for (line <- io.Source.fromFile(inFile).getLines()) {
if (currentSize > blockSize) {
tmpFiles ::= writeSorted(buf)
buf = List[String]()
currentSize = 0
}
buf ::= line
currentSize += line.length() * 2 // 2 bytes per char
}
if (currentSize > 0) tmpFiles ::= writeSorted(buf)
tmpFiles
}
/** Merges results of sorted partitions into one output file */
def mergeSortedFiles(fs: List[java.io.File])
(implicit ord: Ordering[String]): Int = {
/** Temp file buffer for reading lines */
class TempFileBuffer(val file: java.io.File) {
private val in = new java.io.BufferedReader(
new java.io.FileReader(file), TempBufferSize)
private var curLine: String = ""
readNextLine() // prep first value
def currentLine = curLine
def isEmpty = curLine == null
def readNextLine() {
if (curLine == null) return
try {
curLine = in.readLine()
} catch {
case _: java.io.EOFException => curLine = null
}
if (curLine == null) in.close()
}
override protected def finalize() {
try {
in.close()
} finally {
super.finalize()
}
}
}
val wrappedOrd = new Ordering[TempFileBuffer] {
def compare(o1: TempFileBuffer, o2: TempFileBuffer): Int = {
ord.compare(o1.currentLine, o2.currentLine)
}
}
val pq = new collection.mutable.PriorityQueue[TempFileBuffer](
)(wrappedOrd)
// Init queue with item from each file
for (tmp <- fs) {
val buf = new TempFileBuffer(tmp)
if (!buf.isEmpty) pq += buf
}
var count = 0
val out = new java.io.PrintWriter(new java.io.File(outFileName))
try {
// Read each value off of queue
while (pq.size > 0) {
val buf = pq.dequeue()
out.println(buf.currentLine)
count += 1
buf.readNextLine()
if (buf.isEmpty) {
buf.file.delete() // don't need anymore
} else {
// re-add to priority queue so we can process next line
pq += buf
}
}
} finally {
out.close()
}
count
}
mergeSortedFiles(partitionAndSort())
}
My tests don't show any bugs in PriorityQueue.
import org.scalacheck._
import Prop._
object PriorityQueueProperties extends Properties("PriorityQueue") {
def listToPQ(l: List[String]): PriorityQueue[String] = {
val pq = new PriorityQueue[String]
l foreach (pq +=)
pq
}
def pqToList(pq: PriorityQueue[String]): List[String] =
if (pq.isEmpty) Nil
else { val h = pq.dequeue; h :: pqToList(pq) }
property("Enqueued elements are dequeued in reverse order") =
forAll { (l: List[String]) => l.sorted == pqToList(listToPQ(l)).reverse }
property("Adding/removing elements doesn't break sorting") =
forAll { (l: List[String], s: String) =>
(l.size > 0) ==>
((s :: l.sorted.init).sorted == {
val pq = listToPQ(l)
pq.dequeue
pq += s
pqToList(pq).reverse
})
}
}
scala> PriorityQueueProperties.check
+ PriorityQueue.Enqueued elements are dequeued in reverse order: OK, passed
100 tests.
+ PriorityQueue.Adding/removing elements doesn't break sorting: OK, passed
100 tests.
If you could somehow reduce the input enough to make a test case, it would help.
I ran it with five million inputs several times, output matched expected always. My guess from looking at your code is that your Ordering is the problem (i.e. it's giving inconsistent answers.)