Java.lang.IllegalArgumentException: requirement failed: Columns not found in Double - scala

I am working in spark I have many csv files that contain lines, a line looks like that:
2017,16,16,51,1,1,4,-79.6,-101.90,-98.900
It can contain more or less fields, depends on the csv file
Each file corresponds to a cassandra table, where I need to insert all the lines the file contains so what I basically do is get the line, split its elements and put them in a List[Double]
sc.stop
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val nameTable = "artport"
val ligne = "20171,16,165481,51,1,1,4,-79.6000,-101.7000,-98.9000"
val linetoinsert : List[String] = ligne.split(",").toList
var ainserer : Array[Double] = new Array[Double](linetoinsert.length)
for (l <- 0 to linetoinsert.length)yield {ainserer(l) = linetoinsert(l).toDouble}
val liste = ainserer.toList
val rdd = sc.parallelize(liste)
rdd.saveToCassandra("db", nameTable) //db is the name of my keyspace in cassandra
When I run my code I get this error
java.lang.IllegalArgumentException: requirement failed: Columns not found in Double: [collecttime, sbnid, enodebid, rackid, shelfid, slotid, channelid, c373910000, c373910001, c373910002]
at scala.Predef$.require(Predef.scala:224)
at com.datastax.spark.connector.mapper.DefaultColumnMapper.columnMapForWriting(DefaultColumnMapper.scala:108)
at com.datastax.spark.connector.writer.MappedToGettableDataConverter$$anon$1.<init>(MappedToGettableDataConverter.scala:37)
at com.datastax.spark.connector.writer.MappedToGettableDataConverter$.apply(MappedToGettableDataConverter.scala:28)
at com.datastax.spark.connector.writer.DefaultRowWriter.<init>(DefaultRowWriter.scala:17)
at com.datastax.spark.connector.writer.DefaultRowWriter$$anon$1.rowWriter(DefaultRowWriter.scala:31)
at com.datastax.spark.connector.writer.DefaultRowWriter$$anon$1.rowWriter(DefaultRowWriter.scala:29)
at com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:382)
at com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:35)
... 60 elided
I figured out that the insertion works if my RDD was of type :
rdd: org.apache.spark.rdd.RDD[(Double, Double, Double, Double, Double, Double, Double, Double, Double, Double)]
But the one I get from what I am doing is RDD org.apache.spark.rdd.RDD[Double]
I can't use scala Tuple9 for example because I don't know the number of elements my list is going to contain before execution, this solution also doesn't fit my problem because sometimes I have more than 100 columns in my csv and tuple stops at Tuple22
Thanks for your help

As #SergGr mentioned Cassandra table has a schema with known columns. So you need to map your Array to Cassandra schema before saving to Cassandra database. You can use Case Class for this. Try the following code, I assume each column in Cassandra table is of type Double.
//create a case class equivalent to your Cassandra table
case class Schema(collecttime: Double,
sbnid: Double,
enodebid: Double,
rackid: Double,
shelfid: Double,
slotid: Double,
channelid: Double,
c373910000: Double,
c373910001: Double,
c373910002: Double)
object test {
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val nameTable = "artport"
val ligne = "20171,16,165481,51,1,1,4,-79.6000,-101.7000,-98.9000"
//parse ligne string Schema case class
val schema = parseString(ligne)
//get RDD[Schema]
val rdd = sc.parallelize(Seq(schema))
//now you can save this RDD to cassandra
rdd.saveToCassandra("db", nameTable)
}
//function to parse string to Schema case class
def parseString(s: String): Schema = {
//get each field from string array
val Array(collecttime, sbnid, enodebid, rackid, shelfid, slotid,
channelid, c373910000, c373910001, c373910002, _*) = s.split(",").map(_.toDouble)
//map those fields to Schema class
Schema(collecttime,
sbnid,
enodebid,
rackid,
shelfid,
slotid,
channelid,
c373910000,
c373910001,
c373910002)
}
}

Related

Transforming RDD[String] to RDD[myclass]

I am trying to transform RDD[String] to RDD[Picture] but could not do it. If I could manage to convert RDD to RDD[Picture] I would use the def hasValidCountry to check if the values latitude and longitude of the picture meta valid. And after that I am trying to check if user Tags are valid with def hasTags in Picture class. The problem I encounter :
Implicit conversion found: row ⇒ augmentString(row): scala.collection.immutable.StringOps
type mismatch; found : String required: Array[String]
value InterestingPics is not a member of Array[Nothing] possible cause: maybe a semicolon is missing before `value InterestingPics'?
My intention is to choose line which has valid country and tags and transform all the line to new RDD[Picture] class.
ScalaFile1 (I have updated the ScalaFile):
object Part2 {
def main(args: Array[String]): Unit = {
var spark: SparkSession = null
try {
spark = SparkSession.builder().appName("Flickr using dataframes").config("spark.master", "local[*]").getOrCreate()
val originalFlickrMeta: RDD[String] = spark.sparkContext.textFile("flickrSample.txt")
val InterestingPics = originalFlickrMeta.map(row => row.split('\t')).map(field => Picture(field(0).toString())
InterestingPics.collect
InterestingPics.take(5).foreach(println)
This works, as an example:
case class case_for_rdd(c1: Int, c2: String, c3: String)
val rdd_data = spark.sparkContext.textFile("/FileStore/tables/csv01-4.txt")
val rdd = rdd_data.map(row => row.split(',')).map(field => case_for_rdd(field(0).toInt, field(1), field(2)))
rdd.collect
More complicated example with reading into RDD from file with array. Array needs a delimiter.
1,10,100,aa|bb|cc
2,20,200,xxxxxx|yyyyyyyy|z|aaa
Some sample code, but use List as otherwise you get to see array addresses, that's what those strange strings are, courtesy of smarter
people here:
case class case_for_rdd(c1: Int, c2: String, c3: String, a4: List[String])
val rdd_data = spark.sparkContext.textFile("/FileStore/tables/csv03.txt")
val myCaseRdd = rdd_data.map(row => row.split(',')).map(field => case_for_rdd(field(0).toInt, field(1), field(2), (field(3).split("\\|").toList)))
myCaseRdd.collect
My advice is to use a DF and the splitting stuff is then easier. Also, manipulation of the rdd via transformation, then the case class is lost. Array with DF api has no such issue.
I have an solution to my question in accordence with help of #thebluephantom. Thank you very much.
val InterestingPics = originalFlickrMeta.map(line => (new Picture(line.split("\t")))).filter(f => f.c != null && f.userTags.length > 0)
InterestingPics.collect().foreach(println)

KMeansModel.clusterCenters returns NULL

I am using AWS glue to execute Kmeans clustering on my dataset. I wish to find not only the cluster labels but also the cluster centers. I am failing to find the later.
In the code below model.clusterCenters returns NULL. KMeans clustering works fine, and it returns the cluster label i.e. clusterInstance variable.
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
object Clustering {
case class ObjectDay(realnumber: Double, bnumber : Double, blockednumber: Double,
creationdate : String, fname : String, uniqueid : Long, registrationdate : String,
plusnumber : Double, cvalue : Double, hvalue : Double)
case class ClusterInfo( instance: Int, centers: String)
def main(args: Array[String]): Unit = {
val sc: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(sc)
val spark: SparkSession = glueContext.getSparkSession
import spark.implicits._
// write your code here - start
// Data Catalog: database and table name
val dbName = "dbname"
val tblName = "raw"
val sqlText = "SELECT <columns removed> FROM viewname WHERE `creation_date` ="
// S3 location for output
val outputDir = "s3://blucket/path/"
// Read data into a DynamicFrame using the Data Catalog metadata
val rawDyf: DynamicFrame = glueContext.getCatalogSource(database = dbName, tableName = tblName).getDynamicFrame()
// get only single day data with only numbers
// Spark SQL on a Spark dataframe
val numberDf = rawDyf.toDF()
numberDf.createOrReplaceTempView("viewname")
def getDataViaSql(runDate : LocalDate): RDD[ObjectDay] ={
val data = spark.sql(s"${sqlText} '${runDate.toString}'")
data.as[ObjectDay].rdd
}
def getDenseVector(rddnumbers: RDD[ObjectDay]): RDD[linalg.Vector]={
rddnumbers.map(s => Vectors.dense(Array(s.realnumber, s.bnumber, s.blockednumber))).cache()
}
def getClusters( numbers: RDD[linalg.Vector] ): RDD[ClusterInfo] = {
// Trains a k-means model
val model: KMeansModel = KMeans.train(numbers, 2, 20)
val centers: Array[linalg.Vector] = model.clusterCenters
//put together unique_ids with cluster predictions
val clusters: RDD[Int] = model.predict(numbers)
clusters.map{ clusterInstance =>
ClusterInfo(clusterInstance.toInt, centers(clusterInstance).toJson)
}
}
def combineDataAndClusterInstances(rddnumbers : RDD[ObjectDay], clusterCenters: RDD[ClusterInfo]): DataFrame ={
val numbersWithCluster = rddnumbers.zip(clusterCenters)
numbersWithCluster.map(
x =>
(x._1.realnumber, x._1.bnumber, x._1.blockednumber, x._1.creationdate, x._1.fname,
x._1.uniqueid, x._1.registrationdate, x._1.plusnumber, x._1.cvalue, x._1.hvalue,
x._2.instance, x._2.centers)
)
.toDF("realnumber", "bnumber", "blockednumber", "creationdate",
"fname","uniqueid", "registrationdate", "plusnumber", "cvalue", "hvalue",
"clusterInstance", "clusterCenter")
}
def process(runDate : LocalDate): DataFrame = {
val rddnumbers = getDataViaSql( runDate)
val dense = getDenseVector(rddnumbers)
val clusterCenters = getClusters(dense)
combineDataAndClusterInstances(rddnumbers, clusterCenters)
}
val startdt = LocalDate.parse("2018-01-01", DateTimeFormatter.ofPattern("yyyy-MM-dd"))
val dfByDates = (0 to 240)
.map(days => startdt.plusDays(days))
.map(process(_))
val result = dfByDates.tail.fold(dfByDates.head)((accDF, newDF) => accDF.union(newDF))
val output = DynamicFrame(result, glueContext).withName(name="prediction")
// write your code here - end
glueContext.getSinkWithFormat(connectionType = "s3",
options = JsonOptions(Map("path" -> outputDir)), format = "csv").writeDynamicFrame(output)
}
}
I can successfully find the cluster centres using Python sklearn library on the same data.
UPDATED: Showing the complete Scala code which runs as Glue job. Also I am not getting any error while running the job. I just dont get any cluster centres.
What am I missing ?
Nevermind. It is generating cluster centres.
I didnt see the S3 output files until now.
I was running Glue Crawler and looking at the results in AWS Athena.
The crawler created a struct or array column datatype for clustercenter column and Athena failed to parse and read the JSON stored as string in the CSV output.
Sorry to bother.

Read CSV with more than 22 colums in Apache Flink

What I've been doing until now is read CSV as follows:
val data = env.readCsvFile[ElecNormNew](getClass.getResource("/elecNormNew.arff").getPath)
val dataSet = data map { tuple =>
val list = tuple.productIterator.toList
val numList = list map (_.asInstanceOf[Double])
LabeledVector(numList(8), DenseVector(numList.take(8).toArray))
}
Where ElecNorNew is a case class:
case class ElecNormNew(
var date: Double,
var day: Double,
var period: Double,
var nswprice: Double,
var nswdemand: Double,
var vicprice: Double,
var vicdemand: Double,
var transfer: Double,
var label: Double) extends Serializable {
}
As specified in Flink's docs. But now I am trying to read a CSV with 53 columns. Is there a way to automate this process? Do I need to create a POJO with 53 fields?
Update
After Fabian's answer, I am trying this:
val fieldTypes: Array[TypeInformation[_]] = Array(Types.STRING, Types.INT)
val rowIF = new RowCsvInputFormat(new Path(getClass.getResource("/lungcancer.csv").getPath), fieldTypes)
val csvData: DataSet[Row] = env.createInput[Row](rowIF)
val dataSet2 = csvData.map { tuple =>
???
}
But do not know how to continue, how I am suppose to use RowTypeInfo?
You can use a RowCsvInputFormat as follows:
val fieldTypes: Array[TypeInformation[_]] = Array(Types.STRING, Types.INT)
val rowIF = new RowCsvInputFormat(new Path("file:///myCsv"), fieldTypes)
val csvData: DataSet[Row] = env.createInput[Row](rowIF)
Row stores the data in an Array[Any]. Therefore, Flink cannot automatically infer the field types of a Row. This makes is a bit harder to use than typed tuples or case classes. You need to explicitly provide RowTypeInfo with the correct types. This can be done as implicit values or by functions that extend the ResultTypeQueryable interface.

how to convert RDD[(String, Any)] to Array(Row)?

I've got a unstructured RDD with keys and values. The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps. I would like to make them of type Row so I can make a dataframe eventually. Here is my rdd :
removed
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? Perhaps split them into their own rdd, especially for reverseDeltas ?
Thanks
Edit
This is what I've tired so far based on the first answer below.
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map in scala ? I keep getting nulls back when printing out parsedRdd
To convert the RDD to a dataframe you need to have fixed schema. If you define the schema for the RDD rest is simple.
something like
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF
there is a method for converting an rdd to dataframe
use it like below
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

Create a RDD : too many fields => use case class for RDD

I have a dataset of intrusion which is labeled that I want to use to test different supervised machine learning techniques.
So here is a part of my code :
object parser_dataset {
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("kdd")
.set("spark.executor.memory", "8g")
conf.registerKryoClasses(Array(
classOf[Array[Any]],
classOf[Array[scala.Tuple3[Int, Int, Int]]],
classOf[String],
classOf[Any]
))
val context = new SparkContext(conf)
def load(file: String): RDD[(Int, String, String,String,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Double,Double,Double,Double,Double,Double,Double, Int, Int,Double, Double, Double, Double, Double, Double, Double, Double, String)] = {
val data = context.textFile(file)
val res = data.map(x => {
val s = x.split(",")
(s(0).toInt, s(1), s(2), s(3), s(4).toInt, s(5).toInt, s(6).toInt, s(7).toInt, s(8).toInt, s(9).toInt, s(10).toInt, s(11).toInt, s(12).toInt, s(13).toInt, s(14).toInt, s(15).toInt, s(16).toInt, s(17).toInt, s(18).toInt, s(19).toInt, s(20).toInt, s(21).toInt, s(22).toInt, s(23).toInt, s(24).toDouble, s(25).toDouble, s(26).toDouble, s(27).toDouble, s(28).toDouble, s(29).toDouble, s(30).toDouble, s(31).toInt, s(32).toInt, s(33).toDouble, s(34).toDouble, s(35).toDouble, s(36).toDouble, s(37).toDouble, s(38).toDouble, s(39).toDouble, s(40).toDouble, s(41))
})
.persist(StorageLevel.MEMORY_AND_DISK)
return res
}
def main(args: Array[String]) {
val data = this.load("/home/hvfd8529/Datasets/KDDCup99/kddcup.data_10_percent_corrected")
data1.collect.foreach(println)
data.distinct()
}
}
This is not my code, it was given to me and I just modified some parts (especially the RDD and splitting parts) and I'm a newbie at Scala and Spark :)
EDIT:
So I added case class above my load function, like this :
case class BasicFeatures(duration:Int, protocol_type:String, service:String, flag:String, src_bytes:Int, dst_bytes:Int, land:Int, wrong_fragment:Int, urgent:Int)
case class ContentFeatures(hot:Int, num_failed_logins:Int, logged_in:Int, num_compromised:Int, root_shell:Int, su_attempted:Int, num_root:Int, num_file_creations:Int, num_shells:Int, num_access_files:Int, num_outbound_cmds:Int, is_host_login:Int, is_guest_login:Int)
case class TrafficFeatures(count:Int, srv_count:Int, serror_rate:Double, srv_error_rate:Double, rerror_rate:Double, srv_rerror_rate:Double, same_srv_rate:Double, diff_srv_rate:Double, srv_diff_host_rate:Double, dst_host_count:Int, dst_host_srv_count:Int, dst_host_same_srv_rate:Double, dst_host_diff_srv_rate:Double, dst_host_same_src_port_rate:Double, dst_host_srv_diff_host_rate:Double, dst_host_serror_rate:Double, dst_host_srv_serror_rate:Double, dst_host_rerror_rate:Double, dst_host_srv_rerror_rate:Double, attack_type:String )
But now I am confused, how can I use these to solve my problem, because I still need a RDD in order to have one feature = one field
Here is my one line of my file I want to parse :
0,tcp,ftp_data,SF,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,150,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal,20
Max tuple size supported by Scala is 22.Scala function have limit of 22 Parameter. Hence you can not create tuple of size more that 22.