Processing Array[Byte] in spark dataframes - scala

I have a dataframe df1 as below with schema:
scala> df1.printSchema
root
|-- filecontent: binary (nullable = true)
|-- filename: string (nullable = true)
The DF has filename and its content. The content is GZIPped. I could use something like the below to unzip the data in filecontent and save it to HDFS.
def decompressor(origRow: Row) = {
val filename = origRow.getString(1)
val filecontent = serialise(origRow.getString(0))
val unzippedData = new GZIPInputStream(new ByteArrayInputStream(filecontent))
val hadoop_fs = FileSystem.get(sc.hadoopConfiguration)
val filenamePath = new Path(filename)
val fos = hadoop_fs.create(filenamePath)
org.apache.hadoop.io.IOUtils.copyBytes(unzippedData, fos, sc.hadoopConfiguration)
fos.close()
}
My objective:
Since the filecontent column data in the df1 is a binary i.e Array[byte] i shouldnt distribute the data and have it together and pass it to the function so that it could decompress and save it to a file.
My Question:
How do I not distribute the data (column data)?
How do I make sure the processing happens for 1 row at a time?

Related

Scala explode followed by UDF on a dataframe fails

I have a scala dataframe with the following schema:
root
|-- time: string (nullable = true)
|-- itemId: string (nullable = true)
|-- itemFeatures: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
I want to explode the itemFeatures column and then send my dataframe to a UDF. But as soon as I include the explode, calling the UDF results in this error:
org.apache.spark.SparkException: Task not serializable
I can't figure out why???
Environment: Scala 2.11.12, Spark 2.4.4
Full example:
val dataList = List(
("time1", "id1", "map1"),
("time2", "id2", "map2"))
val df = dataList.toDF("time", "itemId", "itemFeatures")
val dfExploded = df.select(col("time"), col("itemId"), explode("itemFeatures"))
val doNextThingUDF: UserDefinedFunction = udf(doNextThing _)
val dfNextThing = dfExploded.withColumn("nextThing", doNextThingUDF(col("time"))
where my UDF looks like this:
val doNextThing(time: String): String = {
time+"blah"
}
If I remove the explode, everything works fine, or if I don't call the UDF after the explode, everything works fine. I could imagine Spark is somehow unable to send each row to a UDF if it is dynamically executing the explode and doesn't know how many rows that are going to exist, but even when I add ex dfExploded.cache() and dfExploded.count() I still get the error. Is this a known issue? What am I missing?
I think the issue come from how you define your donextThing function. Also
there is couple of typos in your "full example".
Especially the itemFeatures column is a string in your example, I understand it should be a Map.
But here is a working example:
val dataList = List(
("time1", "id1", Map("map1" -> 1)),
("time2", "id2", Map("map2" -> 2)))
val df = dataList.toDF("time", "itemId", "itemFeatures")
val dfExploded = df.select(col("time"), col("itemId"), explode($"itemFeatures"))
val doNextThing = (time: String) => {time+"blah"}
val doNextThingUDF = udf(doNextThing)
val dfNextThing = dfExploded.withColumn("nextThing", doNextThingUDF(col("time")))

Reading ambiguous column name in Spark sql Dataframe using scala

I have duplicate columns in text file and when I try to load that text file using spark scala code, it gets loaded successfully into data frame and I can see the first 20 rows by df.Show()
Full Code:-
val sc = new SparkContext(conf)
val hivesql = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sc.textFile("/...FilePath.../*")
val fieldCount = rdd.map(_.split("[|]")).map(x => x.size).first()
val field = rdd.zipWithIndex.filter(_._2==0).map(_._1).first()
val fields = field.split("[|]").map(fieldName =>StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val rowRDD = rdd.map(_.split("[|]")).map(attributes => getARow(attributes,fieldCount))
val df = hivesql.createDataFrame(rowRDD, schema)
df.registerTempTable("Sample_File")
df.Show()
Till this point my code works fine.
But as soon as I try below code then it gives me error.
val results = hivesql.sql("Select id,sequence,sequence from Sample_File")
so I have 2 columns with same name in text file i.e sequence
How can I access that two columns.. I tried with sequence#2 but still not working
Spark Version:-1.6.0
Scala Version:- 2.10.5
result of df.printschema()
|-- id: string (nullable = true)
|-- sequence: string (nullable = true)
|-- sequence: string (nullable = true)
I second #smart_coder's approach, I have a slightly different approach though. Please find it below.
You need to have unique column names to do query from hivesql.sql.
you can rename the column names dynamically by using below code:
Your code:
val df = hivesql.createDataFrame(rowRDD, schema)
After this point, we need to remove ambiguity, below is the solution:
var list = df.schema.map(_.name).toList
for(i <- 0 to list.size -1){
val cont = list.count(_ == list(i))
val col = list(i)
if(cont != 1){
list = list.take(i) ++ List(col+i) ++ list.drop(i+1)
}
}
val df1 = df.toDF(list: _*)
// you would get the output as below:
result of df1.printschema()
|-- id: string (nullable = true)
|-- sequence1: string (nullable = true)
|-- sequence: string (nullable = true)
So basically, we are getting all the column names as a list, then checking if any column is repeating more than once,
if a column is repeating, we are appending the column name with the index, then we create a new dataframe d1 with the new list with renamed column names.
I have tested this in Spark 2.4, but it should work in 1.6 as well.
The below code might help you to resolve your problem. I have tested this in Spark 1.6.3.
val sc = new SparkContext(conf)
val hivesql = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sc.textFile("/...FilePath.../*")
val fieldCount = rdd.map(_.split("[|]")).map(x => x.size).first()
val field = rdd.zipWithIndex.filter(_._2==0).map(_._1).first()
val fields = field.split("[|]").map(fieldName =>StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val rowRDD = rdd.map(_.split("[|]")).map(attributes => getARow(attributes,fieldCount))
val df = hivesql.createDataFrame(rowRDD, schema)
val colNames = Seq("id","sequence1","sequence2")
val df1 = df.toDF(colNames: _*)
df1.registerTempTable("Sample_File")
val results = hivesql.sql("select id,sequence1,sequence2 from Sample_File")

Convert a dataframe into json string in Spark

I'm a bit new to Spark and Scala.I have a (large ~ 1million) Scala Spark DataFrame, and I need to make it a json String.
the schema of the df like this
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
|--valKey(String)
|--vslScore(Double)
key is product id and, value is some produt set and it's score values that I get from a parquet file.
I only manage to get something like this. For curly brackets I simply concatenate them to result.
3434343<tab>{smartphones:apple:0.4564879,smartphones:samsung:0.723643 }
But I expect a value like this.Each value should have a
3434343<tab>{"smartphones:apple":0.4564879, "smartphones:samsung":0.723643 }
are there anyway that I directly convert this into a Json string without concatenate anything. I hope to write output files into .csv format. This is code I'm using
val df = parquetReaderDF.withColumn("key",col("productId"))
.withColumn("value", struct(
col("productType"),
col("brand"),
col("score")))
.select("key","value")
val df2 = df.withColumn("valKey", concat(
col("productType"),lit(":")
,col("brand"),lit(":"),
col("score")))
.groupBy("key")
.agg(collect_list(col("valKey")))
.map{ r =>
val key = r.getAs[String]("key")
val value = r.getAs[Seq[String]] ("collect_list(valKey)").mkString(",")
(key,value)
}
.toDF("key", "valKey")
.withColumn("valKey", concat(lit("{"), col("valKey"), lit("}")))
df.coalesce(1)
.write.mode(SaveMode.Overwrite)
.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.option("header", "false")
.option("quoteMode", "yes")
.save("data.csv")

Best approch for parsing large structured file with Apache spark

I have huge text file (in GBs) with plan text data in each line, which needs to be parsed and extracted to a structure for further processing. Each line has text with 200 charactor length and I have an Regular Expression to parse each line and split into different groups, which will later saved to a flat column data
data sample
1759387ACD06JAN1910MAR191234567ACRT
RegExp
(.{7})(.{3})(.{7})(.{7})(.{7})(.{4})
Data Structure
Customer ID, Code, From Date, To Date, TrasactionId, Product code
1759387, ACD, 06JAN19, 10MAR19, 1234567, ACRT
Please suggest a BEST approch to parse this huge data and push to In Memory grid, which will be used again by Spark Jobs for further processing, when respective APIs are invoked.
You can use the DF approach. Copy the serial file to HDFS using -copyFromLocal command
and use the below code to parse each records
I'm assuming the sample records in gireesh.txt as below
1759387ACD06JAN1910MAR191234567ACRT
2759387ACD08JAN1910MAY191234567ACRY
3759387ACD03JAN1910FEB191234567ACRZ
The spark code
import org.apache.log4j.{Level, Logger}
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.Encoders._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
object Gireesh {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().appName("Operations..").master("local[*]").getOrCreate()
import spark.implicits._
val pat="""(.{7})(.{3})(.{7})(.{7})(.{7})(.{4})""".r
val headers = List("custid","code","fdate","tdate","tranid","prdcode")
val rdd = spark.sparkContext.textFile("in/gireesh.txt")
.map( x => {
val y = scala.collection.mutable.ArrayBuffer[String]()
pat.findAllIn(x).matchData.foreach( m=> y.appendAll(m.subgroups))
(y(0).toLong,y(1),y(2),y(3),y(4).toLong,y(5))
}
)
val df = rdd.toDF(headers:_*)
df.printSchema()
df.show(false)
}
}
gives the below results.
root
|-- custid: long (nullable = false)
|-- code: string (nullable = true)
|-- fdate: string (nullable = true)
|-- tdate: string (nullable = true)
|-- tranid: long (nullable = false)
|-- prdcode: string (nullable = true)
+-------+----+-------+-------+-------+-------+
|custid |code|fdate |tdate |tranid |prdcode|
+-------+----+-------+-------+-------+-------+
|1759387|ACD |06JAN19|10MAR19|1234567|ACRT |
|2759387|ACD |08JAN19|10MAY19|1234567|ACRY |
|3759387|ACD |03JAN19|10FEB19|1234567|ACRZ |
+-------+----+-------+-------+-------+-------+
EDIT1:
You can have the map "transformation" in a separate function like below.
def parse(record:String) = {
val y = scala.collection.mutable.ArrayBuffer[String]()
pat.findAllIn(record).matchData.foreach( m=> y.appendAll(m.subgroups))
(y(0).toLong,y(1),y(2),y(3),y(4).toLong,y(5))
}
val rdd = spark.sparkContext.textFile("in/gireesh.txt")
.map( x => parse(x) )
val df = rdd.toDF(headers:_*)
df.printSchema()
You need to tell spark which file to read and how to process the content while reading it.
Here is an example:
val numberOfPartitions = 5 // this needs to be optimized based on the size of the file and the available resources (e.g. memory)
val someObjectsRDD: RDD[SomeObject] =
sparkContext.textFile("/path/to/your/file", numberOfPartitions)
.mapPartitions(
{ stringsFromFileIterator =>
stringsFromFileIterator.map(stringFromFile => //here process the raw string and return the result)
}
, preservesPartitioning = true
)
In the code snippet SomeObject is an object with the data structure from the question

How to extract values from json string?

I have a file which has bunch of columns and one column called jsonstring is of string type which has json strings in it… let's say the format is the following:
{
"key1": "value1",
"key2": {
"level2key1": "level2value1",
"level2key2": "level2value2"
}
}
I want to parse this column something like this: jsonstring.key1,jsonstring.key2.level2key1 to return value1, level2value1
How can I do that in scala or spark sql.
With Spark 2.2 you could use the function from_json which does the JSON parsing for you.
from_json(e: Column, schema: String, options: Map[String, String]): Column parses a column containing a JSON string into a StructType or ArrayType of StructTypes with the specified schema.
With the support for flattening nested columns by using * (star) that seems the best solution.
// the input dataset (just a single JSON blob)
val jsonstrings = Seq("""{
"key1": "value1",
"key2": {
"level2key1": "level2value1",
"level2key2": "level2value2"
}
}""").toDF("jsonstring")
// define the schema of JSON messages
import org.apache.spark.sql.types._
val key2schema = new StructType()
.add($"level2key1".string)
.add($"level2key2".string)
val schema = new StructType()
.add($"key1".string)
.add("key2", key2schema)
scala> schema.printTreeString
root
|-- key1: string (nullable = true)
|-- key2: struct (nullable = true)
| |-- level2key1: string (nullable = true)
| |-- level2key2: string (nullable = true)
val messages = jsonstrings
.select(from_json($"jsonstring", schema) as "json")
.select("json.*") // <-- flattening nested fields
scala> messages.show(truncate = false)
+------+---------------------------+
|key1 |key2 |
+------+---------------------------+
|value1|[level2value1,level2value2]|
+------+---------------------------+
scala> messages.select("key1", "key2.*").show(truncate = false)
+------+------------+------------+
|key1 |level2key1 |level2key2 |
+------+------------+------------+
|value1|level2value1|level2value2|
+------+------------+------------+
You can use withColumn + udf + json4s:
import org.json4s.{DefaultFormats, MappingException}
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.functions._
def getJsonContent(jsonstring: String): (String, String) = {
implicit val formats = DefaultFormats
val parsedJson = parse(jsonstring)
val value1 = (parsedJson \ "key1").extract[String]
val level2value1 = (parsedJson \ "key2" \ "level2key1").extract[String]
(value1, level2value1)
}
val getJsonContentUDF = udf((jsonstring: String) => getJsonContent(jsonstring))
df.withColumn("parsedJson", getJsonContentUDF(df("jsonstring")))