csv to avro without apache spark in scala - scala

Is there a way I can convert a scv file to Avro without using Apache Spark. I see most of the post suggests using spark which I cannot in my case. I have a schema in a separate file. I was thinking of some custom serializer and deserializer that will use the Schema and convert csv to avro. Any kind of reference would work for me.
Thanks

If you only have strings and primitives, you could put together a crude implementation like this fairly easily:
def csvToAvro(file: Sting, schema: Schema) = {
val rec = new GenericData.Record(schema)
val types = schema
.getFields
.map { f => f.pos -> f.schema.getType }
Source.fromFile(file)
.getLines
.map(_.split("_").toSeq)
.foreach { data =>
(data zip types)
.foreach {
case (str, (idx, STRING)) => rec.put(idx, str)
case (str, (idx, INT)) => rec.put(idx, str.toInt)
case (str, (idx, LONG)) => rec.put(idx, str.toLong)
case (str, (idx, FLOAT)) => rec.put(idx, str.toFloat)
case (str, (idx, DOUBLE)) => rec.put(idx, str.toDouble)
case (str, (idx, BOOLEAN)) => rec.put(idx, str.toBoolean)
case (str, (idx, unknown)) => throw new IllegalArgumentException(s"Don't know how to convert $str to $unknown at $idx))
}
}
rec
}
Note this does not handle nullable fields: for those the type is going to be UNION, and you'll have to look inside the schema to find out the actual data type.
Also, "parsing csv" is very crude here (just splitting at the comma isn't really a good idea, because it'll break if a string field happens to contain , in the data, or if fields are escaped with double-quotes).
And also, you'll probably want to add some sanity-checking to make sure, for example, that the number of fields in the csv line matches the number of fields in the schema etc.
But the above considerations notwithstanding, this should be sufficient to illustrate the approach and get you started.

Avro is an open format, there are many languages which support it.
Just pick one, like python for example which also support csv. But Go would do, and Java also.

Related

read a file in scala and get key value pairs as Map[String, List[String]]

i am reading a file and getting the records as a Map[String, List[String]] in spark-scala. similar thing i want to achieve in pure scala form without any spark references(not reading an rdd). what should i change to make it work in a pure scala way
rdd
.filter(x => (x != null) && (x.length > 0))
.zipWithIndex()
.map {
case (line, index) =>
val array = line.split("~").map(_.trim)
(array(0), array(1), index)
}
.groupBy(_._1)
.mapValues(x => x.toList.sortBy(_._3).map(_._2))
.collect
.toMap
For the most part it will remain the same except for the groupBy part in rdd. Scala List also has the map, filter, reduce etc. methods. So they can be used in almost a similar fashion.
val lines = Source.fromFile('filename.txt').getLines.toList
Once the file is read and stored in List, the methods can be applied to it.
For the groupBy part, one simple approach can be to sort the tuples on the key. That will effectively cluster the tuples with same keys together.
val grouped = scala.util.Sorting.stablesort(arr, (e1: String, e2: String, e3: String)
=> e1._1 < e2._2)
There can be better solutions definitely, but this would effectively do the same task.
I came up with the below approach
Source.fromInputStream(
getClass.getResourceAsStream(filePath)).getLines.filter(
lines =>(lines != null) && (lines.length > 0)).map(_.split("~")).toList.groupBy(_(0)).map{ case (key, values) => (key, values.map(_(1))) }

Are Play form JSON parsers filtering out empty arrays incorrectly?

In a PUT request from a client device, I would like the application/json body to look like {"content": []} in order to clear a list I'm maintaining for the client. However, in a Play Form such key-value pair where the value is an empty array disappears after form parsing.
You can see that in regular (non-Form) Play JSON parsing, the empty array is maintained as expected.
[info] Starting scala interpreter...
Welcome to Scala 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212).
Type in expressions for evaluation. Or try :help.
scala> import play.api.libs.json._
import play.api.libs.json._
scala> val json = Json.parse("""{"content": []}""")
json: play.api.libs.json.JsValue = {"content":[]}
However, in Play's form parsing, the behavior is different.
private[data] object FormUtils {
def fromJson(prefix: String = "", js: JsValue): Map[String, String] = js match {
case JsObject(fields) =>
val prefix2 = Option(prefix).filterNot(_.isEmpty).map(_ + ".").getOrElse("")
fields.iterator
.map { case (key, value) => fromJson(prefix2 + key, value) }
.foldLeft(Map.empty[String, String])(_ ++ _)
case JsArray(values) =>
values.zipWithIndex.iterator
.map { case (value, i) => fromJson(s"$prefix[$i]", value) }
.foldLeft(Map.empty[String, String])(_ ++ _)
case JsNull => Map.empty
case JsUndefined() => Map.empty
case JsBoolean(value) => Map(prefix -> value.toString)
case JsNumber(value) => Map(prefix -> value.toString)
case JsString(value) => Map(prefix -> value.toString)
}
}
https://github.com/playframework/playframework/blob/826c76ee967d8ec35b76b9a8b594bfaa676a9510/core/play/src/main/scala/play/api/data/Form.scala#L397
The general idea of the function is a recursive traversal of the incoming JSValue objects and arrays where null, undefined, booleans, numbers, strings, and empty objects and arrays serve as base cases.
In the JsArray case, if values is empty then it returns Map.empty[String, String]. As a result, both [] and {"content": []} will be parsed to Map.empty[String, String]. For a form, the [] case seems reasonable, but {"content": []} is a legitimate key-value pair that should be kept in the form. Note that the same issue occurs with an empty JsObject.
Does this seem like a reasonable issue? If so, any workarounds that you can think of? I understand that it could be difficult for Play to change this since users of the framework might already be depending on this behavior.
I think that is correct behavior, since as far as forms are concerned there is no concept of an Array.
A form simply consists of a set of key-value-pairs and an array is just flattened to fit into this scheme, a la path.to.array[0] = value.
The form, however, can then be mapped to a model which actually has a sequence of values.

Spark Scala - get number of unique values by keys

This is a question from a beginner. I have a text file containing computer login information. Once I filter bad records, and map to only the 2 elements I need, I get rdd that looks like:
(user10,Server1)
(user40,Server2)
(user20,Server2)
(user25,Server2)
(user30,Server2)
(user30,Server2)
(user71,Server1)
(user10,Server1)
I need to find for each server the count of unique users, I would like to get something like:
(server1,2)
(server2,4)
I need to stay at Rdd level; no data-frames yet, and I don't know how to proceed. Any help is appreciated.
I provide a solution easy to understand for you.
def logic(data: RDD[(String, String)]
): RDD[(String, Int)] = {
data
.map { case (user, server) =>
(server, Set(user))
}
.reduceByKey(_ ++ _)
.map { case (server, userSet) =>
(server, userSet.size)
}
}
Set data structure can be used as the tool to find unique users.
If you already have reduced the input text file to following RDD
(user10,Server1)
(user40,Server2)
(user20,Server2)
(user25,Server2)
(user30,Server2)
(user30,Server2)
(user71,Server1)
(user10,Server1)
Final RDD that you require would be similar to wordcount examples thats abundant in the web but needs a little bit trick. You can do the following
val finalRdd = rdd.groupBy(x => (x._1, x._2)).map{case(k,v) => k}.map(x => (x._2, 1)).reduceByKey(_+_)
finalRdd would be
(Server2,4)
(Server1,2)

Scala - Traversing a ByteString Until Empty

Is there a more concise and/or performant way to traverse the message than what I have here?
import akka.util.ByteString
#throws[GarbledMessageException]
def nextValue(message: ByteString) =
message.indexOf(delimiter) match {
case i if i >= 0 => message.splitAt(i)
case _ => throw new GarbledMessageException("Delimiter Not Found")
}
#tailrec
def processFields(message: ByteString): Unit = nextValue(message) match {
case (_, ByteString.empty) => // Complete Parsing
case (value, rest) =>
// Do work with value
// loop
processFields(rest)
}
A new ByteString is created for each split which hurts performance, but at least the underlying Buffer is not copied, only reference counted.
Maybe it can be even better than that?
It may depend on specifically what kind of work you are doing, but if you are looking for something more performant than splitting off ByteStrings, take a look at ByteIterator, which you can get by calling iterator on a ByteString.
A ByteIterator would allow you to go directly to primitive values (ints, floats, etc.) without having to split off new ByteStrings first.

Summary Statistics for string types in spark

Is there something like summary function in spark like that in "R".
The summary calculation which comes with spark(MultivariateStatisticalSummary) operates only on numeric types.
I am interested in getting the results for string types also like the first four max occuring strings(groupby kind of operation) , number of uniques etc.
Is there any preexisting code for this ?
If not what please suggest the best way to deal with string types.
I don't think there is such a thing for String in MLlib. But it would probably be a valuable contribution, if you are going to implement it.
Calculating just one of these metrics is easy. E.g. for top 4 by frequency:
def top4(rdd: org.apache.spark.rdd.RDD[String]) =
rdd
.map(s => (s, 1))
.reduceByKey(_ + _)
.map { case (s, count) => (count, s) }
.top(4)
.map { case (count, s) => s }
Or number of uniques:
def numUnique(rdd: org.apache.spark.rdd.RDD[String]) =
rdd.distinct.count
But doing this for all metrics in a single pass takes more work.
These examples assume that, if you have multiple "columns" of data, you have split each column into a separate RDD. This is a good way to organize the data, and it's necessary for operations that perform a shuffle.
What I mean by splitting up the columns:
def split(together: RDD[(Long, Seq[String])],
columns: Int): Seq[RDD[(Long, String)]] = {
together.cache // We will do N passes over this RDD.
(0 until columns).map {
i => together.mapValues(s => s(i))
}
}