Saving data to sequence file - scala

I'm trying to do some sort of filtering on Sequence file, and save it back to another sequence file, example:
val subset = ???
val hc = sc.hadoopConfiguration
val serializers = List(
classOf[WritableSerialization].getName,
classOf[ResultSerialization].getName
).mkString(",")
hc.set("io.serializations", serializers)
subset.saveAsNewAPIHadoopFile(
"output/sequence",
classOf[ImmutableBytesWritable],
classOf[Result],
classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],
hc
)
After compilation I receive following error:
Class[org.apache.hadoop.mapred.SequenceFileOutputFormat[org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.hbase.client.Result]](classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat])
required: Class[_ <: org.apache.hadoop.mapreduce.OutputFormat[_, _]] classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],
To my knowledge SequenceFileOuputFormat extends FileOutputFormat which extends OutputFormat, but I am missing something.
Can you please help?
I raised issue with Spark team at https://issues.apache.org/jira/browse/SPARK-25405

Related

List files in directory (including file information) with Scala/Spark

I'm pretty new to Scala/Spark and I hope you guys can help me. I want to get the files which were created after a certain timestamp in a directory of a hdfs for a little monitoring in Zeppelin. Therefore I need a column with the file name , the file size and the modificationDate.
I found this is working for me to get all the information I need:
val fs = FileSystem.get(new Configuration())
val dir: String = "some/hdfs/path"
val input_files = fs.listStatus(new Path(dir)).filter(_.getModificationTime> timeInEpoch)
With the result I would like to create a DataFrame in spark with a row for each file with its information (or at least the information mentioned above)
val data = sc.parallelize(input_files)
val dfFromData2 = spark.createDataFrame(data).toDF()
If I try it this way I get the following response:
309: error: overloaded method value createDataFrame with alternatives:
[A <: Product](data: Seq[A])(implicit evidence$3: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame <and>
[A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$2: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[org.apache.hadoop.fs.FileStatus])
val dfFromData2 = spark.createDataFrame(data).toDF()
I hope you can help me out :)
Greetings
As the error message indicates, the Hadoop FileStatus type is not a subtype of Product, i.e. a Tuple. Spark DataFrames have their own SQL-style type system which doesn't allow for arbitrary, complex types like FileStatus. Likewise, if you were to attempt an operation on the RDD you created you would receive a similar error as FileStatus is not serializable. Your best bet is to extract the data you need as a tuple or case class and create a DataFrame from that:
case class FileInfo(name : String, modifiedTime : Long, size : Long)
val df = input_files.map{x =>
FileInfo(x.getPath.toString, x.getModificationTime, x.getLen)
}.toSeq.toDF()

Scalacache with redis support

I am trying to integrate redis to scalacache. Keys are usually string but values can be objects, Set[String], etc. Cache is initialized by this
val cache: RedisCache = RedisCache(config.host, config.port)
private implicit val scalaCache: ScalaCache[Array[Byte]] = ScalaCache(cacheService.cache)
But while calling put, i am getting this error "Could not find any Codecs for type Set[String] and Repr". Looks like i need to provide codec for my cache input as suggested here so i added,
class A extends Codec[Set[String], Array[Byte]] with GZippingBinaryCodec[Set[String]]
Even after, my class A, is throwing the same error. What am i missing.
As you mentioned in the link, you can either serialize values in a binary format:
import scalacache.serialization.binary._
or as JSON using circe:
import scalacache.serialization.circe._
import io.circe.generic.auto._
Looks like its solved in next release by binary and circe serialization. I am on version 10 and solved by the following,
implicit object SetBindaryCodec extends Codec[Any, Array[Byte]] {
override def serialize(value: Any): Array[Byte] = {
val stream: ByteArrayOutputStream = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(stream)
oos.writeObject(value)
oos.close()
stream.toByteArray
}
override def deserialize(data: Array[Byte]): Any = {
val ois = new ObjectInputStream(new ByteArrayInputStream(data))
val value = ois.readObject
ois.close()
value
}
}
Perks of being up to date. Will upgrade the version, posted it just in case somebody needs it.

using datetime/timestamp in scala slick

is there an easy way to use datetime/timestamp in scala? What's best practice? I currently use "date" to persist data, but I'd also like to persist the current time.
I'm struggling to set the date. This is my code:
val now = new java.sql.Timestamp(new java.util.Date().getTime)
I also tried to do this:
val now = new java.sql.Date(new java.util.Date().getTime)
When changing the datatype in my evolutions to "timestamp", I got an error:
case class MyObjectModel(
id: Option[Int],
title: String,
createdat: Timestamp,
updatedat: Timestamp,
...)
object MyObjectModel{
implicit val myObjectFormat = Json.format[MyObjectModel]
}
Console:
app\models\MyObjectModel.scala:31: No implicit format for
java.sql.Timestamp available.
[error] implicit val myObjectFormat = Json.format[MyObjectModel]
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
Update:
object ProcessStepTemplatesModel {
implicit lazy val timestampFormat: Format[Timestamp] = new Format[Timestamp] {
override def reads(json: JsValue): JsResult[Timestamp] = json.validate[Long].map(l => Timestamp.from(Instant.ofEpochMilli(l)))
override def writes(o: Timestamp): JsValue = JsNumber(o.getTime)
}
implicit val processStepFormat = Json.format[ProcessStepTemplatesModel]
}
try using this in your code
implicit object timestampFormat extends Format[Timestamp] {
val format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SS'Z'")
def reads(json: JsValue) = {
val str = json.as[String]
JsSuccess(new Timestamp(format.parse(str).getTime))
}
def writes(ts: Timestamp) = JsString(format.format(ts))
}
it is (de)serialized in a JS compatible format like the following "2018-01-06T18:31:29.436Z"
please note: the implicit object shall be decleared in the code before it is used
I guess your question is handled in What's the standard way to work with dates and times in Scala? Should I use Java types or there are native Scala alternatives?.
Go with Java 8 "java.time".
In the subject you mention Slick (Scala Database Library) but the error you got comes from a Json library and it says that you don't have a converter for java.sql.Timestamp to Json. Without knowing which Json library you are using it's hard to help you with a working example.

Enriching SparkContext without incurring in serialization issues

I am trying to use Spark to process data that comes from HBase tables. This blog post gives an example of how to use NewHadoopAPI to read data from any Hadoop InputFormat.
What I have done
Since I will need to do this many times, I was trying to use implicits to enrich SparkContext, so that I can get an RDD from a given set of columns in HBase. I have written the following helper:
trait HBaseReadSupport {
implicit def toHBaseSC(sc: SparkContext) = new HBaseSC(sc)
implicit def bytes2string(bytes: Array[Byte]) = new String(bytes)
}
final class HBaseSC(sc: SparkContext) extends Serializable {
def extract[A](data: Map[String, List[String]], result: Result, interpret: Array[Byte] => A) =
data map { case (cf, columns) =>
val content = columns map { column =>
val cell = result.getColumnLatestCell(cf.getBytes, column.getBytes)
column -> interpret(CellUtil.cloneValue(cell))
} toMap
cf -> content
}
def makeConf(table: String) = {
val conf = HBaseConfiguration.create()
conf.setBoolean("hbase.cluster.distributed", true)
conf.setInt("hbase.client.scanner.caching", 10000)
conf.set(TableInputFormat.INPUT_TABLE, table)
conf
}
def hbase[A](table: String, data: Map[String, List[String]])
(interpret: Array[Byte] => A) =
sc.newAPIHadoopRDD(makeConf(table), classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]) map { case (key, row) =>
Bytes.toString(key.get) -> extract(data, row, interpret)
}
}
It can be used like
val rdd = sc.hbase[String](table, Map(
"cf" -> List("col1", "col2")
))
In this case we get an RDD of (String, Map[String, Map[String, String]]), where the first component is the rowkey and the second is a map whose key are column families and the values are maps whose keys are columns and whose content are the cell values.
Where it fails
Unfortunately, it seems that my job gets a reference to sc, which is itself not serializable by design. What I get when I run the job is
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
I can remove the helper classes and use the same logic inline in my job and everything runs fine. But I want to get something which I can reuse instead of writing the same boilerplate over and over.
By the way, the issue is not specific to implicit, even using a function of sc exhibits the same problem.
For comparison, the following helper to read TSV files (I know it's broken as it does not support quoting and so on, never mind) seems to work fine:
trait TsvReadSupport {
implicit def toTsvRDD(sc: SparkContext) = new TsvRDD(sc)
}
final class TsvRDD(val sc: SparkContext) extends Serializable {
def tsv(path: String, fields: Seq[String], separator: Char = '\t') = sc.textFile(path) map { line =>
val contents = line.split(separator).toList
(fields, contents).zipped.toMap
}
}
How can I encapsulate the logic to read rows from HBase without unintentionally capturing the SparkContext?
Just add #transient annotation to sc variable:
final class HBaseSC(#transient val sc: SparkContext) extends Serializable {
...
}
and make sure sc is not used within extract function, since it won't be available on workers.
If it's necessary to access Spark context from within distributed computation, rdd.context function might be used:
val rdd = sc.newAPIHadoopRDD(...)
rdd map {
case (k, v) =>
val ctx = rdd.context
....
}

How to convert, group and sort java.util.List[java.util.Map[String, Object]]?

I convert this:
import scala.collection.JavaConverters._
import scala.collection.JavaConverters._
val list:java.util.List[java.util.Map[String, Object]] = new java.util.ArrayList[java.util.Map[String, Object]]()
val map1:java.util.Map[String, AnyRef] = new java.util.HashMap[String,AnyRef]()
map1.put("payout", 3.asInstanceOf[AnyRef])
list.add(map1)
val map2:java.util.Map[String, AnyRef] = new java.util.HashMap[String, AnyRef]()
map2.put("payout", 2.asInstanceOf[AnyRef])
list.add(map2)
val map3:java.util.Map[String, AnyRef] = new java.util.HashMap[String, AnyRef]()
map3.put("payout", 2.asInstanceOf[AnyRef])
list.add(map3)
val map4:java.util.Map[String, AnyRef] = new java.util.HashMap[String, AnyRef]()
map4.put("payout", 1.asInstanceOf[AnyRef])
list.add(map4)
println(list)
val result = list.asScala
//result Buffer({payout=3}, {payout=2}, {payout=2}, {payout=1})
And i wish:
list.asScala.groupBy(_("payout")).toList save its ordering (sort by payout)
but .toList.sortBy(_._1) throw error:
error: No implicit Ordering defined for java.lang.Object.
val result = list.groupBy(_("payout")).toList.sortBy(_._1)
This gives a result, but I don't know if its what you wanted:
val result = list.asScala.map(_.asScala).groupBy(_("payout")).toList.sortWith(_._1.asInstanceOf[Int] > _._1.asInstanceOf[Int])
I added map(.asScala) in order to convert your java maps to scala maps. The group by value is a java.lang.Object which does not have an ordering; using sortWith(._1.asInstanceOf[Int] > _._1.asInstanceOf[Int]) I cast it to Int in order to sort it. This will of course crash if some other object is used, but there is no way to order an object that you don't know anything about.