Read CSV with more than 22 colums in Apache Flink - scala

What I've been doing until now is read CSV as follows:
val data = env.readCsvFile[ElecNormNew](getClass.getResource("/elecNormNew.arff").getPath)
val dataSet = data map { tuple =>
val list = tuple.productIterator.toList
val numList = list map (_.asInstanceOf[Double])
LabeledVector(numList(8), DenseVector(numList.take(8).toArray))
}
Where ElecNorNew is a case class:
case class ElecNormNew(
var date: Double,
var day: Double,
var period: Double,
var nswprice: Double,
var nswdemand: Double,
var vicprice: Double,
var vicdemand: Double,
var transfer: Double,
var label: Double) extends Serializable {
}
As specified in Flink's docs. But now I am trying to read a CSV with 53 columns. Is there a way to automate this process? Do I need to create a POJO with 53 fields?
Update
After Fabian's answer, I am trying this:
val fieldTypes: Array[TypeInformation[_]] = Array(Types.STRING, Types.INT)
val rowIF = new RowCsvInputFormat(new Path(getClass.getResource("/lungcancer.csv").getPath), fieldTypes)
val csvData: DataSet[Row] = env.createInput[Row](rowIF)
val dataSet2 = csvData.map { tuple =>
???
}
But do not know how to continue, how I am suppose to use RowTypeInfo?

You can use a RowCsvInputFormat as follows:
val fieldTypes: Array[TypeInformation[_]] = Array(Types.STRING, Types.INT)
val rowIF = new RowCsvInputFormat(new Path("file:///myCsv"), fieldTypes)
val csvData: DataSet[Row] = env.createInput[Row](rowIF)
Row stores the data in an Array[Any]. Therefore, Flink cannot automatically infer the field types of a Row. This makes is a bit harder to use than typed tuples or case classes. You need to explicitly provide RowTypeInfo with the correct types. This can be done as implicit values or by functions that extend the ResultTypeQueryable interface.

Related

ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast

I am using an Aggregator to apply some custom merge on a DataFrame after grouping its records by their primary key:
case class Player(
pk: String,
ts: String,
first_name: String,
date_of_birth: String
)
case class PlayerProcessed(
var ts: String,
var first_name: String,
var date_of_birth: String
)
// Cutomer Aggregator -This just for the example, actual one is more complex
object BatchDedupe extends Aggregator[Player, PlayerProcessed, PlayerProcessed] {
def zero: PlayerProcessed = PlayerProcessed("0", null, null)
def reduce(bf: PlayerProcessed, in : Player): PlayerProcessed = {
bf.ts = in.ts
bf.first_name = in.first_name
bf.date_of_birth = in.date_of_birth
bf
}
def merge(bf1: PlayerProcessed, bf2: PlayerProcessed): PlayerProcessed = {
bf1.ts = bf2.ts
bf1.first_name = bf2.first_name
bf1.date_of_birth = bf2.date_of_birth
bf1
}
def finish(reduction: PlayerProcessed): PlayerProcessed = reduction
def bufferEncoder: Encoder[PlayerProcessed] = Encoders.product
def outputEncoder: Encoder[PlayerProcessed] = Encoders.product
}
val ply1 = Player("12121212121212", "10000001", "Rogger", "1980-01-02")
val ply2 = Player("12121212121212", "10000002", "Rogg", null)
val ply3 = Player("12121212121212", "10000004", null, "1985-01-02")
val ply4 = Player("12121212121212", "10000003", "Roggelio", "1982-01-02")
val seq_users = sc.parallelize(Seq(ply1, ply2, ply3, ply4)).toDF.as[Player]
val grouped = seq_users.groupByKey(_.pk)
val non_sorted = grouped.agg(BatchDedupe.toColumn.name("deduped"))
non_sorted.show(false)
This returns:
+--------------+--------------------------------+
|key |deduped |
+--------------+--------------------------------+
|12121212121212|{10000003, Roggelio, 1982-01-02}|
+--------------+--------------------------------+
Now, I would like to order the records based on ts before aggregating them. From here I understand that .sortBy("ts") do not guarantee the order after the .groupByKey(_.pk). So I was trying to apply the .sortBy between the .groupByKey and the .agg
The output of the .groupByKey(_.pk) is a KeyValueGroupedDataset[String,Player], being the second element an Iterator. So to apply some sorting logic there I convert it into a Seq:
val sorted = grouped.mapGroups{case(k, iter) => (k, iter.toSeq.sortBy(_.ts))}.agg(BatchDedupe.toColumn.name("deduped"))
sorted.show(false)
However, the output of .mapGroups after adding the sorting logic is a Dataset[(String, Seq[Player])]. So when I try to invoke the .agg function on it I am getting the following exception:
Caused by: ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line050e0d37885948cd91f7f7dd9e3b4da9311.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Player
How could I convert back the output of my .mapGroups(...) into a KeyValueGroupedDataset[String,Player]?
I tried to cast back to Iterator as follows:
val sorted = grouped.mapGroups{case(k, iter) => (k, iter.toSeq.sortBy(_.ts).toIterator)}.agg(BatchDedupe.toColumn.name("deduped"))
But this approach produced the following exception:
UnsupportedOperationException: No Encoder found for Iterator[Player]
- field (class: "scala.collection.Iterator", name: "_2")
- root class: "scala.Tuple2"
How else can I add the sort logic between the .groupByKey and .agg methods?
Based on the discussion above, the purpose of the Aggregator is to get the latest field values per Player by ts ignoring null values.
This can be achieved fairly easily aggregating all fields individually using max_by. With that there's no need for a custom Aggregator nor the mutable aggregation buffer.
import org.apache.spark.sql.functions._
val players: Dataset[Player] = ...
// aggregate all columns except the key individually by ts
// NULLs will be ignored (SQL standard)
val aggColumns = players.columns
.filterNot(_ == "pk")
.map(colName => expr(s"max_by($colName, if(isNotNull($colName), ts, null))").as(colName))
val aggregatedPlayers = players
.groupBy(col("pk"))
.agg(aggColumns.head, aggColumns.tail: _*)
.as[Player]
On the most recent versions of Spark you can also use the build in max_by expression:
import org.apache.spark.sql.functions._
val players: Dataset[Player] = ...
// aggregate all columns except the key individually by ts
// NULLs will be ignored (SQL standard)
val aggColumns = players.columns
.filterNot(_ == "pk")
.map(colName => max_by(col(colName), when(col(colName).isNotNull, col("ts"))).as(colName))
val aggregatedPlayers = players
.groupBy(col("pk"))
.agg(aggColumns.head, aggColumns.tail: _*)
.as[Player]

Java.lang.IllegalArgumentException: requirement failed: Columns not found in Double

I am working in spark I have many csv files that contain lines, a line looks like that:
2017,16,16,51,1,1,4,-79.6,-101.90,-98.900
It can contain more or less fields, depends on the csv file
Each file corresponds to a cassandra table, where I need to insert all the lines the file contains so what I basically do is get the line, split its elements and put them in a List[Double]
sc.stop
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val nameTable = "artport"
val ligne = "20171,16,165481,51,1,1,4,-79.6000,-101.7000,-98.9000"
val linetoinsert : List[String] = ligne.split(",").toList
var ainserer : Array[Double] = new Array[Double](linetoinsert.length)
for (l <- 0 to linetoinsert.length)yield {ainserer(l) = linetoinsert(l).toDouble}
val liste = ainserer.toList
val rdd = sc.parallelize(liste)
rdd.saveToCassandra("db", nameTable) //db is the name of my keyspace in cassandra
When I run my code I get this error
java.lang.IllegalArgumentException: requirement failed: Columns not found in Double: [collecttime, sbnid, enodebid, rackid, shelfid, slotid, channelid, c373910000, c373910001, c373910002]
at scala.Predef$.require(Predef.scala:224)
at com.datastax.spark.connector.mapper.DefaultColumnMapper.columnMapForWriting(DefaultColumnMapper.scala:108)
at com.datastax.spark.connector.writer.MappedToGettableDataConverter$$anon$1.<init>(MappedToGettableDataConverter.scala:37)
at com.datastax.spark.connector.writer.MappedToGettableDataConverter$.apply(MappedToGettableDataConverter.scala:28)
at com.datastax.spark.connector.writer.DefaultRowWriter.<init>(DefaultRowWriter.scala:17)
at com.datastax.spark.connector.writer.DefaultRowWriter$$anon$1.rowWriter(DefaultRowWriter.scala:31)
at com.datastax.spark.connector.writer.DefaultRowWriter$$anon$1.rowWriter(DefaultRowWriter.scala:29)
at com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:382)
at com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:35)
... 60 elided
I figured out that the insertion works if my RDD was of type :
rdd: org.apache.spark.rdd.RDD[(Double, Double, Double, Double, Double, Double, Double, Double, Double, Double)]
But the one I get from what I am doing is RDD org.apache.spark.rdd.RDD[Double]
I can't use scala Tuple9 for example because I don't know the number of elements my list is going to contain before execution, this solution also doesn't fit my problem because sometimes I have more than 100 columns in my csv and tuple stops at Tuple22
Thanks for your help
As #SergGr mentioned Cassandra table has a schema with known columns. So you need to map your Array to Cassandra schema before saving to Cassandra database. You can use Case Class for this. Try the following code, I assume each column in Cassandra table is of type Double.
//create a case class equivalent to your Cassandra table
case class Schema(collecttime: Double,
sbnid: Double,
enodebid: Double,
rackid: Double,
shelfid: Double,
slotid: Double,
channelid: Double,
c373910000: Double,
c373910001: Double,
c373910002: Double)
object test {
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val nameTable = "artport"
val ligne = "20171,16,165481,51,1,1,4,-79.6000,-101.7000,-98.9000"
//parse ligne string Schema case class
val schema = parseString(ligne)
//get RDD[Schema]
val rdd = sc.parallelize(Seq(schema))
//now you can save this RDD to cassandra
rdd.saveToCassandra("db", nameTable)
}
//function to parse string to Schema case class
def parseString(s: String): Schema = {
//get each field from string array
val Array(collecttime, sbnid, enodebid, rackid, shelfid, slotid,
channelid, c373910000, c373910001, c373910002, _*) = s.split(",").map(_.toDouble)
//map those fields to Schema class
Schema(collecttime,
sbnid,
enodebid,
rackid,
shelfid,
slotid,
channelid,
c373910000,
c373910001,
c373910002)
}
}

how to convert RDD[(String, Any)] to Array(Row)?

I've got a unstructured RDD with keys and values. The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps. I would like to make them of type Row so I can make a dataframe eventually. Here is my rdd :
removed
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? Perhaps split them into their own rdd, especially for reverseDeltas ?
Thanks
Edit
This is what I've tired so far based on the first answer below.
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map in scala ? I keep getting nulls back when printing out parsedRdd
To convert the RDD to a dataframe you need to have fixed schema. If you define the schema for the RDD rest is simple.
something like
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF
there is a method for converting an rdd to dataframe
use it like below
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

SPARK SQL : How to convert List[List[Any]) to Data Frame

val list = List(List(1,"Ankita"),List(2,"Kunal"))
and now I want to convert it into the data frame -
val list = List(List(1,"Ankita"),List(2,"Kunal")).toDF("id","name")
but it throws an error -
java.lang.ClassNotFoundException: Scala.any
AFAIK, List[List[Any]] cannot be converted to DataFrame directly, It need to convert to some type (here I took example to Person) List[Person]
case class Person(id: Int, name: String)
val list = List(List(1,"Ankita"),List(2,"Kunal"))
val listDf = list.map(x => Person(x(0).asInstanceOf[Int], x(1).toString)).toDF("id","name")
Another way is per the comment of user8371915, Create list of pairs and convert to DataFrame
val listDf = list.map {
case List(id: Int, name: String) => (id, name) } toDF("id", "name")
Because List (inside List) can be arbitrary size and can not using implicit type conversion.
It can be converted if you change to List of Tuple.
val list = List((1,"Ankita"),(2,"Kunal")).toDF("id","name")

Create a RDD : too many fields => use case class for RDD

I have a dataset of intrusion which is labeled that I want to use to test different supervised machine learning techniques.
So here is a part of my code :
object parser_dataset {
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("kdd")
.set("spark.executor.memory", "8g")
conf.registerKryoClasses(Array(
classOf[Array[Any]],
classOf[Array[scala.Tuple3[Int, Int, Int]]],
classOf[String],
classOf[Any]
))
val context = new SparkContext(conf)
def load(file: String): RDD[(Int, String, String,String,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Double,Double,Double,Double,Double,Double,Double, Int, Int,Double, Double, Double, Double, Double, Double, Double, Double, String)] = {
val data = context.textFile(file)
val res = data.map(x => {
val s = x.split(",")
(s(0).toInt, s(1), s(2), s(3), s(4).toInt, s(5).toInt, s(6).toInt, s(7).toInt, s(8).toInt, s(9).toInt, s(10).toInt, s(11).toInt, s(12).toInt, s(13).toInt, s(14).toInt, s(15).toInt, s(16).toInt, s(17).toInt, s(18).toInt, s(19).toInt, s(20).toInt, s(21).toInt, s(22).toInt, s(23).toInt, s(24).toDouble, s(25).toDouble, s(26).toDouble, s(27).toDouble, s(28).toDouble, s(29).toDouble, s(30).toDouble, s(31).toInt, s(32).toInt, s(33).toDouble, s(34).toDouble, s(35).toDouble, s(36).toDouble, s(37).toDouble, s(38).toDouble, s(39).toDouble, s(40).toDouble, s(41))
})
.persist(StorageLevel.MEMORY_AND_DISK)
return res
}
def main(args: Array[String]) {
val data = this.load("/home/hvfd8529/Datasets/KDDCup99/kddcup.data_10_percent_corrected")
data1.collect.foreach(println)
data.distinct()
}
}
This is not my code, it was given to me and I just modified some parts (especially the RDD and splitting parts) and I'm a newbie at Scala and Spark :)
EDIT:
So I added case class above my load function, like this :
case class BasicFeatures(duration:Int, protocol_type:String, service:String, flag:String, src_bytes:Int, dst_bytes:Int, land:Int, wrong_fragment:Int, urgent:Int)
case class ContentFeatures(hot:Int, num_failed_logins:Int, logged_in:Int, num_compromised:Int, root_shell:Int, su_attempted:Int, num_root:Int, num_file_creations:Int, num_shells:Int, num_access_files:Int, num_outbound_cmds:Int, is_host_login:Int, is_guest_login:Int)
case class TrafficFeatures(count:Int, srv_count:Int, serror_rate:Double, srv_error_rate:Double, rerror_rate:Double, srv_rerror_rate:Double, same_srv_rate:Double, diff_srv_rate:Double, srv_diff_host_rate:Double, dst_host_count:Int, dst_host_srv_count:Int, dst_host_same_srv_rate:Double, dst_host_diff_srv_rate:Double, dst_host_same_src_port_rate:Double, dst_host_srv_diff_host_rate:Double, dst_host_serror_rate:Double, dst_host_srv_serror_rate:Double, dst_host_rerror_rate:Double, dst_host_srv_rerror_rate:Double, attack_type:String )
But now I am confused, how can I use these to solve my problem, because I still need a RDD in order to have one feature = one field
Here is my one line of my file I want to parse :
0,tcp,ftp_data,SF,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,150,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal,20
Max tuple size supported by Scala is 22.Scala function have limit of 22 Parameter. Hence you can not create tuple of size more that 22.