Scala ElasticSearch indexing in relation to dynamically changing schema - scala

I found the following code below on CERN's website.
FYI: I am using spark 1.3
The example code is fantastic when you know the schema of the dataset you want to index to elasticsearch.
However, could somebody point me in the right direction so that I can create a method as follows:
Pass in as an argument the schema structure from external source (col name / datatype) (the hard bit) along with file name to be indexed (easy bit)?
Perform the schema mappings inside the function dynamically.
By having a method like this would allow me to generate a mapped and indexed dataset in ES.
Example Code:
//import elasticsearch packages
import org.elasticsearch.spark._
//define the schema
case class MemT(dt: String (link is external), server: String (link is external), memoryused: Integer (link is external))
//load the csv file into rdd
val Memcsv = sc.textFile("/tmp/flume_memusage.csv")
//split the fields, trim and map it to the schema
val MemTrdd = Memcsv.map(line=>line.split(",")).map(line=>MemT(line(0).trim.toString,line(1).trim.toString,line(2).trim.toInt))
//write the rdd to the elasticsearch
MemTrdd.saveToEs("fmem/logs")
Thank you!
source:
https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-05-integrating-hadoop-and-elasticsearch-%E2%80%93-part-2-%E2%80%93-writing-and-querying

What i wanted to achieve is to be able index directly into ES from a DataFrame,.
I required that the index mappings be driven from external schema source. Here is how I achieved that....
BTW: There is additional validation/processing that I have omitted, but this skeleton code should get those who require similar needs going....
I have included the following ES dependency in my build.sbt file
"org.elasticsearch" % "elasticsearch-spark_2.10" % "2.3.3"
Comments welcome...
//Just showing the ES stuff
import org.elasticsearch.hadoop
import org.elasticsearch.spark._
//Declare a schema
val schemaString = "age:int, name:string,location:string"
//Fill RDD with dummy data
val rdd = sc.textFile("/path/to/your/file.csv")
val seperator = "," //This is seperator in csv
//Convert schema string above into a struct
val schema =
StructType(
schemaString.split(",").map(fieldName =>
StructField(fieldName.split(":")(0),
getFieldTypeInSchema(fieldName.split(":")(1)), true)))
//map each element of RDD row to RDD with elements
val rowRDDx =rdd.map(p => {
var list: collection.mutable.Seq[Any] = collection.mutable.Seq.empty[Any]
var index = 0
var tokens = p.split(seperator)
tokens.foreach(value => {
var valType = schema.fields(index).dataType
var returnVal: Any = null
valType match {
case IntegerType => returnVal = value.toString.toInt
case DoubleType => returnVal = value.toString.toDouble
case LongType => returnVal = value.toString.toLong
case FloatType => returnVal = value.toString.toFloat
case ByteType => returnVal = value.toString.toByte
case StringType => returnVal = value.toString
case TimestampType => returnVal = value.toString
}
list = list :+ returnVal
index += 1
})
Row.fromSeq(list)
})
//Convert the RDD with elements to a DF , also specify the intended schema
val df = sqlContext.createDataFrame(rowRDDx, schema)
//index the DF to ES
EsSparkSQL.saveToEs(df,"test/doc")

Related

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

My current DataFrame looks like as below:
{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}
I want to transform this dataframe into the below dataFrame:
{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}
This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.
How does the code would look like in order to achieve this. I have tried withColumn but couldn't achieve this.
Try the below code and please find the inline comments for the code explanation
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
object DynamicCol {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val df = spark.read.json("src/main/resources/dyamicCol.json") /// Load the JSON file
val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
val index = dfTemp
.schema.fieldIndex("values")
val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => { // Join Dataframe with the list of nested columns
val colNameInt = (field.name.toDouble * 100).toInt
val colName = s"v$colNameInt"
df.withColumn(colName,col("inputs.values.`" + field.name + "`")) // Add the nested column mappings
} ).drop("inputs") // Drop the extra column
dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
}
}
I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.
def stringDecimalToVNumber(colName:String): String =
"v" + (colName.toFloat * 100).toInt.toString
and form a single function that transforms according to the case
val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
case x => x // keep it
now we have the function to transform the end of the columns, let's pick the schema dynamicly.
val flattenDF = df.select("id","inputs.values.*")
val finalDF = flattenDF
.schema.names
.foldLeft(flattenDF)((dfacum,x) => {
val newName = transformColumnName(x)
if (newName == x)
dfacum // the name didn't need to be changed
else
dfacum.withColumnRenamed(x, transformColumnName(x))
})
This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.

Spark: dynamic schema definition out of csv file

Receiving a schema information as csv file below. Assume i have around 100+columns
FIRSTNAME|VARCHAR2
LASTANME|VARCHAR2
MIDDLENAME|VARCHAR2
BIRTHDATE|DATE
ADULTS|NUMBER
ADDRESS|VARCHAR2
How to generate a schema dynamically in SPARK in this scenario?
You can use string splitting and pattern matching assuming that the schema file is a validly formatted csv. Assuming that you already have the schema loaded as a single comma-separated string, the following will work:
def toSchema(str: String) = {
val structFields = str.split(",").map{ s =>
val split = s.split("\\|")
val name: String = split.head
val typeStr = split.tail.head
val varCharPattern = "varchar[0-9]+".r
val datePattern = "date".r
val numPattern = "number".r
val t = typeStr.toLowerCase match{
case varCharPattern() => StringType
case datePattern() => TimestampType
case numPattern() => IntegerType
case _ => throw new Exception("unknown type string")
}
StructField(name, t)
}
StructType(structFields)
}
You can add more types easily enough by just adding new cases to the pattern matching statement.

Add scoped variable per row iteration in Apache Spark

I'm reading multiple html files into a dataframe in Spark.
I'm converting elements of the html to columns in the dataframe using a custom udf
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.toDF("filepath", "filecontent")
.withColumn("biz_name", parseDocValue(".biz-page-title")('filecontent))
.withColumn("biz_website", parseDocValue(".biz-website a")('filecontent))
...
def parseDocValue(cssSelectorQuery: String) =
udf((html: String) => Jsoup.parse(html).select(cssSelectorQuery).text())
Which works perfectly, however each withColumn call will result in the parsing of the html string, which is redundant.
Is there a way (without using lookup tables or such) that I can generate 1 parsed Document (Jsoup.parse(html)) based on the "filecontent" column per row and make that available for all withColumn calls in the dataframe?
Or shouldn't I even try using DataFrames and just use RDD's ?
So the final answer was in fact quite simple:
Just map over the rows and create the object ones there
def docValue(cssSelectorQuery: String, attr: Option[String] = None)(implicit document: Document): Option[String] = {
val domObject = document.select(cssSelectorQuery)
val domValue = attr match {
case Some(a) => domObject.attr(a)
case None => domObject.text()
}
domValue match {
case x if x == null || x.isEmpty => None
case y => Some(y)
}
}
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath, minPartitions = 265)
.map {
case (filepath, filecontent) => {
implicit val document = Jsoup.parse(filecontent)
val customDataJson = docJson(filecontent, customJsonRegex)
DataEntry(
biz_name = docValue(".biz-page-title"),
biz_website = docValue(".biz-website a"),
url = docValue("meta[property=og:url]", attr = Some("content")),
...
filename = Some(fileName(filepath)),
fileTimestamp = Some(fileTimestamp(filepath))
)
}
}
.toDS()
I'd probably rewrite it as follows, to do the parsing and selecting in one go and put them in a temporary column:
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.withColumn("temp", parseDocValue(Array(".biz-page-title", ".biz-website a"))('filecontent))
.withColumn("biz_name", col("temp")(0))
.withColumn("biz_website", col("temp")(1))
.drop("temp")
def parseDocValue(cssSelectorQueries: Array[String]) =
udf((html: String) => {
val j = Jsoup.parse(html)
cssSelectorQueries.map(query => j.select(query).text())})

How do I save a file in a Spark PairRDD using the key as the filename and the value as the contents?

In Spark, I have downloaded multiple files from s3 using sc.binaryFiles. The RDD that results has the key as the filename and the value has the contents of the file. I have decompressed the file contents, csv parsed it, and converted it to a dataframe. So, now I have a PairRDD[String, DataFrame]. The problem I have is that I want to save the file to HDFS using the key as the filename and save the value as a parquet file overwriting one if it already exists. This is what I got so far.
val files = sc.binaryFiles(lFiles.mkString(","), 250).mapValues(stream => sc.parallelize(readZipStream(new ZipInputStream(stream.open))))
val tables = files.mapValues(file => {
val header = file.first.split(",")
val schema = StructType(header.map(fieldName => StructField(fieldName, StringType, true)))
val lines = file.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }.flatMap(x => x.split("\n"))
val rowRDD = lines.map(x => Row.fromSeq(x.split(",")))
sqlContext.createDataFrame(rowRDD, schema)
})
If you have any advice, please let me know. I would appreciate it.
Thanks,
Ben
the way to save files to HDFS in spark is the same to hadoop. So you need to create a class which extends MultipleTextOutputFormat, in custom class you can define output filename yourself.the example is below:
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = {
"realtime-" + new SimpleDateFormat("yyyyMMddHHmm").format(new Date()) + "00-" + name
}
}
the called code is below:
RDD.rddToPairRDDFunctions(rdd.map { case (key, list) =>
(NullWritable.get, key)
}).saveAsHadoopFile(input, classOf[NullWritable], classOf[String], classOf[RDDMultipleTextOutputFormat])

Convert csv to RDD

I tried the accepted solution in How do I convert csv file to rdd, I want to print out all the users except "om":
val csv = sc.textFile("file.csv") // original file
val data = csv.map(line => line.split(",").map(elem => elem.trim)) //lines in rows
val header = new SimpleCSVHeader(data.take(1)(0)) // we build our header with the first line
val rows = data.filter(line => header(line,"user") != "om") // filter the header out
val users = rows.map(row => header(row,"user")
users.collect().map(user => println(user))
but I got an error:
java.util.NoSuchElementException: key not found: user
I try to debug it and find the index attributes in header look like this:
Since I'm new to spark and scala, does this mean that user is already in a Map? Then why the key not found error?
I found out my mistake. It's not related to Spark/Scala. When I created the example csv, I use command in R:
df <- data.frame(user=c('om','daniel','3754978'),topic=c('scala','spark','spark'),hits=c(120,80,1))
write.csv(df, "df.csv",row.names=FALSE)
but write.csv will add " around factors by default, so that's why the map can't find key user because "user" is the real key, using
write.csv(df, "df.csv",quote=FALSE, row.names=FALSE)
will solve this problem.
I've rewritten the sample code to remove the header method.
IMO, this example provides a step by step walkthrough that is easier to follow. Here is a more detailed explanation.
def main(args: Array[String]): Unit = {
val csv = sc.textFile("/path/to/your/file.csv")
// split / clean data
val headerAndRows = csv.map(line => line.split(",").map(_.trim))
// get header
val header = headerAndRows.first
// filter out header
val data = headerAndRows.filter(_(0) != header(0))
// splits to map (header/value pairs)
val maps = data.map(splits => header.zip(splits).toMap)
// filter out the 'om' user
val result = maps.filter(map => map("user") != "om")
// print result
result.foreach(println)
}