Passing functions in Spark - scala

This is my idea
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
object pizD {
def filePath = {
new File(this.getClass.getClassLoader.getResource("wikipedia/wikipedia.dat").toURI).getPath
}
def regex(line: String): pichA = {
......
......
pichA(t1, t2)
}
}
case class pichA(t1: String, t2: String)
object dushP {
val conf = new SparkConf()
val sc = new SparkContext(conf)
val mirdd: RDD[pichA] = ???
How to integrate sc.textfile with my methods filePath and regex?I want to combine in order to get new rdd.

val baseRDD =sc.textfile(pizD.filepath).filter(line => {
val value = pizD.regex(line)
if(value !=null)
true
else false
})
Assuming pizD.filepath will give you file name as string and regex() would return null value if regex din match. If the understanding is correct, then above code would do the trick.

Related

Migrate code Scala to databricks notebook

Working to get this code running using notebooks in databricks(already tested and working with an IDE), can not get this working if I change the structure of the code.
import java.io.{BufferedReader, InputStreamReader}
import java.text.SimpleDateFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
object TestUnit {
val dateFormat = new SimpleDateFormat("yyyyMMdd")
case class Averages (cust: String, Num: String, date: String, credit: Double)
def main(args: Array[String]): Unit = {
val inputFile = "s3a://tfsdl-ghd-wb/raidnd/Cleartablet.csv"
val outputFile = "s3a://tfsdl-ghd-wb/raidnd/Incte_19&20.csv"
val fileSystem = getFileSystem(inputFile)
val inputData = readCSVFileLines(fileSystem, inputFile, skipHeader = true)
.toSeq
val filtinp = inputData.filter(x => x.nonEmpty)
.map(x => x.split(","))
.map(x => Revenue(x(6), x(5), x(0), x(8).toDouble))
// Create output writer
val writer = new PrintWriter(new File(outputFile))
// Header for output CSV file
writer.write("Date,customer,number,Credit,Average Credit/SKU\n")
filtinp.foreach{x =>
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
}
// Write row to output csv file
writer.write(s"${x.day},${x.customer},${x.number},${x.credit},${avgcredit1},${avgcredit2}\n")
writer.close() // close the writer`
}
}

flink sink to parquet file with AvroParquetWriter is not writing data to file

I am trying to write a parquet file as sink using AvroParquetWriter. The file is created but with 0 length (no data is written). am I doing something wrong ? couldn't figure out what is the problem
import io.eels.component.parquet.ParquetWriterConfig
import org.apache.avro.Schema
import org.apache.avro.generic.{GenericData, GenericRecord}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.hadoop.fs.Path
import org.apache.parquet.avro.AvroParquetWriter
import org.apache.parquet.hadoop.{ParquetFileWriter, ParquetWriter}
import org.apache.parquet.hadoop.metadata.CompressionCodecName
import scala.io.Source
import org.apache.flink.streaming.api.scala._
object Tester extends App {
val env = StreamExecutionEnvironment.getExecutionEnvironment
def now = System.currentTimeMillis()
val path = new Path(s"/tmp/test-$now.parquet")
val schemaString = Source.fromURL(getClass.getResource("/request_schema.avsc")).mkString
val schema: Schema = new Schema.Parser().parse(schemaString)
val compressionCodecName = CompressionCodecName.SNAPPY
val config = ParquetWriterConfig()
val genericReocrd: GenericRecord = new GenericData.Record(schema)
genericReocrd.put("name", "test_b")
genericReocrd.put("code", "NoError")
genericReocrd.put("ts", 100L)
val stream = env.fromElements(genericReocrd)
val writer: ParquetWriter[GenericRecord] = AvroParquetWriter.builder[GenericRecord](path)
.withSchema(schema)
.withCompressionCodec(compressionCodecName)
.withPageSize(config.pageSize)
.withRowGroupSize(config.blockSize)
.withDictionaryEncoding(config.enableDictionary)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withValidation(config.validating)
.build()
writer.write(genericReocrd)
stream.addSink{r =>
writer.write(r)
}
env.execute()
The problem is that you don't close the ParquetWriter. This is necessary to flush pending elements to disk. You could solve the problem by defining your own RichSinkFunction where you close the ParquetWriter in the close method:
class ParquetWriterSink(val path: String, val schema: String, val compressionCodecName: CompressionCodecName, val config: ParquetWriterConfig) extends RichSinkFunction[GenericRecord] {
var parquetWriter: ParquetWriter[GenericRecord] = null
override def open(parameters: Configuration): Unit = {
parquetWriter = AvroParquetWriter.builder[GenericRecord](new Path(path))
.withSchema(new Schema.Parser().parse(schema))
.withCompressionCodec(compressionCodecName)
.withPageSize(config.pageSize)
.withRowGroupSize(config.blockSize)
.withDictionaryEncoding(config.enableDictionary)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withValidation(config.validating)
.build()
}
override def close(): Unit = {
parquetWriter.close()
}
override def invoke(value: GenericRecord, context: SinkFunction.Context[_]): Unit = {
parquetWriter.write(value)
}
}

Error Graph X Caused by: java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V

I have been playing with use classes but I am continuously getting the error above when I try to implement them. This is my code:
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
class EdgeProperties()
case class WriterWriterProperties(weight: String, edgeType: String) extends EdgeProperties
object GraphXAnalysis2 {
val edgeWeightedWriterWriterCollaborated = "in/Graphs/Graph4_WriterWriter/EdgesWeightedWriterWriter_writerscollaborated.csv"
val vertexWriterWriter = "in/Graphs/Graph4_WriterWriter/Vertices.csv"
val conf = new SparkConf().setAppName("Music Graph Application").setMaster("local[1]")
val sc = new SparkContext(conf)
val WriterWriter: RDD[(VertexId, String)] = sc.textFile(vertexWriterWriter).map {
line =>
val row = line.split(",")
(row(0).toLong, row(2))
}
val edgesWriterWriterCollaborated: RDD[Edge[EdgeProperties]] = sc.textFile(edgeWeightedWriterWriterCollaborated).map {
line =>
val row = line.split(",")
Edge(row(0).toLong, row(1).toLong, WriterWriterProperties(row(2), row(3)): EdgeProperties)
}
val graph4 = Graph(WriterWriter, edgesWriterWriterCollaborated)
Am I declaring the class wrongly, using it wrong or putting it in the wrong position? Thank you so much as I am completely new to this.

How can I cast a string to a date in Scala so I can filter in in SparkSQL?

I want to be able to filter on a date just like you would in normal SQL. Is that possible? I'm running into an issue on how to convert the string from the text file into a date.
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.log4j._
import java.text._
//import java.util.Date
import java.sql.Date
object BayAreaBikeAnalysis {
case class Station(ID:Int, name:String, lat:Double, longitude:Double, dockCount:Int, city:String, installationDate:Date)
case class Status(station_id:Int, bikesAvailable:Int, docksAvailable:Int, time:String)
val dateFormat = new SimpleDateFormat("yyyy-MM-dd")
def extractStations(line: String): Station = {
val fields = line.split(",",-1)
val station:Station = Station(fields(0).toInt, fields(1), fields(2).toDouble, fields(3).toDouble, fields(4).toInt, fields(5), dateFormat.parse(fields(6)))
return station
}
def extractStatus(line: String): Status = {
val fields = line.split(",",-1)
val status:Status = Status(fields(0).toInt, fields(1).toInt, fields(2).toInt, fields(3))
return status
}
def main(args: Array[String]) {
// Set the log level to only print errors
//Logger.getLogger("org").setLevel(Level.ERROR)
// Use new SparkSession interface in Spark 2.0
val spark = SparkSession
.builder
.appName("BayAreaBikeAnalysis")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
//Load files into data sets
import spark.implicits._
val stationLines = spark.sparkContext.textFile("Data/station.csv")
val stations = stationLines.map(extractStations).toDS().cache()
val statusLines = spark.sparkContext.textFile("Data/status.csv")
val statuses = statusLines.map(extractStatus).toDS().cache()
//people.select("name").show()
stations.select("installationDate").show()
spark.stop()
}
}
Obviously fields(6).toDate() doesn't compile but I'm not sure what to use.
I think this post is what you are looking for.
Also here you'll find a good tutorial for string parse to date.
Hope this helps!
Following are the ways u can convert string to date in scala.
(1) In case of java.util.date :-
val date= new SimpleDateFormat("yyyy-MM-dd")
date.parse("2017-09-28")
(2) In case of joda's dateTime:-
DateTime.parse("09-28-2017")
Here is a helping function that takes on a string representing a date and transforms it into a Timestamp
import java.sql.Timestamp
import java.util.TimeZone
import java.text.{DateFormat, SimpleDateFormat}
def getTimeStamp(timeStr: String): Timestamp = {
val dateFormat: DateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
dateFormat.setTimeZone(TimeZone.getTimeZone("UTC"))
val date: Option[Timestamp] = {
try {
Some(new Timestamp(dateFormat.parse(timeStr).getTime))
} catch {
case _: Exception => Some(Timestamp.valueOf("19700101'T'000000"))
}
}
date.getOrElse(Timestamp.valueOf(timeStr))
}
Obviously, you will need to change your input date format from "yyyy-MM-dd'T'HH:mm:ss" into whatever format you have the date string.
Hope this helps.

Scala : Product with Serializable does not take parameters

My objectif is to read Data from a csv file and convert my rdd to dataframe in scala/spark. This is my code :
package xxx.DataScience.CompensationStudy
import org.apache.spark._
import org.apache.log4j._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object CompensationAnalysis {
case class GetDF(profil_date:String, profil_pays:String, param_tarif2:String, param_tarif3:String, dt_titre:String, dt_langues:String,
dt_diplomes:String, dt_experience:String, dt_formation:String, dt_outils:String, comp_applications:String,
comp_interventions:String, comp_competence:String)
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val conf = new SparkConf().setAppName("CompensationAnalysis ")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val lines = sc.textFile("C:/Users/../Downloads/CompensationStudy.csv").flatMap { l =>
l.split(",") match {
case field: Array[String] if field.size > 13 => Some(field(0), field(1), field(2), field(3), field(4), field(5), field(6), field(7), field(8), field(9), field(10), field(11), field(12))
case field: Array[String] if field.size == 1 => Some((field(0), "default value"))
case _ => None
}
}
At this stade, I had the error : Product with Serializable does not take parameters
val summary = lines.collect().map(x => GetDF(x("profil_date"), x("profil_pays"), x("param_tarif2"), x("param_tarif3"), x("dt_titre"), x("dt_langues"), x("dt_diplomes"), x("dt_experience"), x("dt_formation"), x("dt_outils"), x("comp_applications"), x("comp_interventions"), x("comp_competence")))
val sum_df = summary.toDF()
df.printSchema
}
}
This is a screenshot :
Help please ?
You have several things you should improve. The most urgent problem, which causes the exception, is, as #CyrilleCorpet points out, " the three different lines in the pattern matching return values of types Some[Tuple13], Some[Tuple2] and None.type. The least-upper-bound is then Option[Product with Serializable] which complies with flatMap's signature (where the result should be an Iterable[T]) modulo some implicit conversion."
Basically, if you had Some[Tuple13], Some[Tuple13], and None or Some[Tuple2], Some[Tuple2], and None, you would be better off.
Also, pattern matching on types is generally a bad idea because of type erasure, and pattern matching isn't even great anyway for your situation.
So you could set default values in your case class:
case class GetDF(profile_date: String,
profile_pays: String = "default",
param_tarif2: String = "default",
...
)
Then in your lambda:
val tokens = l.split
if (l.length > 13) {
Some(GetDf(l(0), l(1), l(2)...))
} else if (l.length == 1) {
Some(GetDf(l(0)))
} else {
None
}
Now in all cases you are returning Option[GetDF]. You can flatMap the RDD to get rid of all the Nones and keep only GetDF instances.