how to parse a CSV file with dataVec using a schema?

how to parse a CSV file with dataVec using a schema? - scala

I am trying to load a CSV data set with canova/datavec, and can not find the "idiomatic" way of doing it. I struggle a bit since I feel that there is an evolution of the framework, which makes it difficult for me to determine what is relevant and what is not.
object S extends App{
val recordReader:RecordReader = new CSVRecordReader(0, ",")
recordReader.initialize(new FileSplit(new File("./src/main/resources/CSVdataSet.csv")))
val iter:DataSetIterator = new RecordReaderDataSetIterator(recordReader, 100)
while(iter.hasNext){
println(iter.next())
}
}
I have a csv file that starts with a header description, and thus my output is an exception
(java.lang.NumberFormatException: For input string: "iid":)
I started looking into the schema builder, since I get an exception because of schema/the header. So I was thinking to add a schema like this;
val schema = new Schema.Builder()
.addColumnInteger("iid")
.build()
From my point of view, the noob-view, the BasicDataVec-examples are not completely clear because they link it to spark etc. From the IrisAnalysisExample (https://github.com/deeplearning4j/dl4j-examples/blob/master/datavec-examples/src/main/java/org/datavec/transform/analysis/IrisAnalysis.java).
I assume that the file content is first read into JavaRDD (potentially a Stream) and then treated afterwards. The schema is not used except for the DataAnalysis.
So, could someone help with making me understand how I parse (as a stream or iterator, a CSV-file with a header description as the first line?
I understand from their book (Deep learning:A practitioners Approach) that spark is needed for data transformation (which a schema is used for). I thus rewrote my code to;
object S extends App{
val schema: Schema = new Schema.Builder()
.addColumnInteger("iid")
.build
val recordReader = new CSVRecordReader(0, ",")
val f = new File("./src/main/resources/CSVdataSet.csv")
recordReader.initialize(new FileSplit(f))
val sparkConf:SparkConf = new SparkConf()
sparkConf.setMaster("local[*]");
sparkConf.setAppName("DataVec Example");
val sc:JavaSparkContext = new JavaSparkContext(sparkConf)
val lines = sc.textFile(f.getAbsolutePath);
val examples = lines.map(new StringToWritablesFunction(new CSVRecordReader()))
val process = new TransformProcess.Builder(schema).build()
val executor = new SparkTransformExecutor()
val processed = executor.execute(examples, process)
println(processed.first())
}
I thought now that the schema would dictate that I only would have the iid-column, but the output is:
[iid, id, gender, idg, .....]

It might be considered bad practice to answer my own question, but I will keep my question (and now answer) for a while to see if it was informative and useful for others.
I understand how to use a schema on data where I can create corresponding schema attribute for all of the features. I originally wanted to work on a dataset with more than 200 feature values in each vector. Having to declare a static schema containing a column attribute for all 200 features made it impractical to use. However, there is probably a more dynamic way of creating schemas, and I just have not found that yet. I decided to test my code on the Iris.csv data set. Here the file contains row attributes for;
Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Which would be implemented as a schema:
val schema: Schema = new Schema.Builder()
.addColumnInteger("Id")
.addColumnDouble("SepalLengthCm")
.addColumnDouble("SepalWidthCm")
.addColumnDouble("PetalLengthCm")
.addColumnDouble("PetalWidthCm")
.addColumnString("Species")
.build
I feel that one of the motives behind using a schema is to be able to transform the data. Thus, I would like to perform a transform operation. A TransformProcess defines a sequence of operations to perform on our data (Using DataVec appendix F page 405 DeepLearning: A practitioners Approach).
A TransformProcess is constructed by specifying two things:
• The Schema of the initial input data
• The set of operations we wish to execute Using DataVec
I decided to see if I could remove a column from the read data:
val process = new TransformProcess.Builder(schema)
.removeColumns("Id")
.build()
Thus, my code became:
import org.datavec.api.records.reader.impl.csv.CSVRecordReader
import org.datavec.api.transform.{DataAction, TransformProcess}
import org.datavec.api.transform.schema.Schema
import java.io.File
import org.apache.spark.api.java.JavaSparkContext
import org.datavec.spark.transform.misc.StringToWritablesFunction
import org.apache.spark.SparkConf
import org.datavec.api.split.FileSplit
import org.datavec.spark.transform.SparkTransformExecutor
object S extends App{
val schema: Schema = new Schema.Builder()
.addColumnInteger("Id")
.addColumnDouble("SepalLengthCm")
.addColumnDouble("SepalWidthCm")
.addColumnDouble("PetalLengthCm")
.addColumnDouble("PetalWidthCm")
.addColumnString("Species")
.build
val recordReader = new CSVRecordReader(0, ",")
val f = new File("./src/main/resources/Iris.csv")
recordReader.initialize(new FileSplit(f))
println(recordReader.next())
val sparkConf:SparkConf = new SparkConf()
sparkConf.setMaster("local[*]");
sparkConf.setAppName("DataVec Example");
val sc:JavaSparkContext = new JavaSparkContext(sparkConf)
val lines = sc.textFile(f.getAbsolutePath);
val examples = lines.map(new StringToWritablesFunction(new CSVRecordReader()))
val process = new TransformProcess.Builder(schema)
.removeColumns("Id")
.build()
val executor = new SparkTransformExecutor()
val processed = executor.execute(examples, process)
println(processed.first())
}
The first prints:
[Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species]
the second prints
[SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species]
Edit: I see that I get a crash with
"org.deeplearning4j" % "deeplearning4j-core" % "0.6.0" as my libraryDependency
while with an old dependency it works
"org.deeplearning4j" % "deeplearning4j-core" % "0.0.3.2.7"
libraryDependencies ++= Seq(
"org.datavec" % "datavec-spark_2.11" % "0.5.0",
"org.datavec" % "datavec-api" % "0.5.0",
"org.deeplearning4j" % "deeplearning4j-core" % "0.0.3.2.7"
//"org.deeplearning4j" % "deeplearning4j-core" % "0.6.0"
)

Related

Flink: RowRowConverter seems to fail for nested DataTypes

I am trying to load a complex JSON file (multiple different data types, nested objects/arrays etc) from my local, read them in as a source using the Table API File System Connector, convert them into DataStream, and then do some action afterwards (not shown here for brevity).
The conversion gives me a DataStream of type DataStream[Row], which I need to convert to DataStream[RowData] (for sink purposes, won't go into details here). Thankfully, there's a RowRowConverter utility that helps to do this mapping. It works when I tried a completely flat JSON, but when I introduced Arrays and Maps within the JSON, it no longer works.
Here is the exception that was thrown - a null pointer exception:
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.allocateWriter(ArrayObjectArrayConverter.java:140)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toBinaryArrayData(ArrayObjectArrayConverter.java:114)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toInternal(ArrayObjectArrayConverter.java:93)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toInternal(ArrayObjectArrayConverter.java:40)
at org.apache.flink.table.data.conversion.DataStructureConverter.toInternalOrNull(DataStructureConverter.java:61)
at org.apache.flink.table.data.conversion.RowRowConverter.toInternal(RowRowConverter.java:75)
at flink.ReadJsonNestedData$.$anonfun$main$2(ReadJsonNestedData.scala:48)
Interestingly, when I setup my breakpoints and debugger this is what I discovered: RowRowConverter::toInternal, the first time it was called works, will go all the way down to ArrayObjectArrayConverter::allocateWriter()
However, for some strange reason, RowRowConverter::toInternal runs twice, and if I continue stepping through eventually it will come back here, which is where the null pointer exception happens.
Example of the JSON (simplified with only a single nested for brevity). I placed it in my /src/main/resources folder
{"discount":[670237.997082,634079.372133,303534.821218]}
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.table.api.DataTypes
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment
import org.apache.flink.table.data.conversion.RowRowConverter
import org.apache.flink.table.types.FieldsDataType
import org.apache.flink.table.types.logical.RowType
import scala.collection.JavaConverters._
object ReadJsonNestedData {
def main(args: Array[String]): Unit = {
// setup
val jsonResource = getClass.getResource("/NESTED.json")
val jsonFilePath = jsonResource.getPath
val tableName = "orders"
val readJSONTable =
s"""
| CREATE TABLE $tableName (
| `discount` ARRAY<DECIMAL(12, 6)>
| )WITH (
| 'connector' = 'filesystem',
| 'path' = '$jsonFilePath',
| 'format' = 'json'
|)""".stripMargin
val colFields = Array(
"discount"
)
val defaultDataTypes = Array(
DataTypes.ARRAY(DataTypes.DECIMAL(12, 6))
)
val rowType = RowType.of(defaultDataTypes.map(_.getLogicalType), colFields)
val defaultDataTypesAsList = defaultDataTypes.toList.asJava
val dataType = new FieldsDataType(rowType, defaultDataTypesAsList)
val rowConverter = RowRowConverter.create(dataType)
// Job
val env = StreamExecutionEnvironment.getExecutionEnvironment()
val tableEnv = StreamTableEnvironment.create(env)
tableEnv.executeSql(readJSONTable)
val ordersTable = tableEnv.from(tableName)
val dataStream = tableEnv
.toDataStream(ordersTable)
.map(row => rowConverter.toInternal(row))
dataStream.print()
env.execute()
}
}
I would hence like to know:
Why RowRowConverter is not working and how I can remedy it
Why RowRowConverter::toInternal is running twice for the same Row .. which may be the cause of that NullPointerException
If my method of instantiating and using the RowRowConverter is correct based on my code above.
Thank you!
Environment:
IntelliJ 2021.3.2 (Ultimate)
AdoptOpenJDK 1.8
Scala: 2.12.15
Flink: 1.13.5
Flink Libraries Used (for this example):
flink-table-api-java-bridge
flink-table-planner-blink
flink-clients
flink-json

The first call of RowRowConverter::toInternal is an internal implementation for making a deep copy of the StreamRecord emitted by table source, which is independent from the converter in your map function. The reason of the NPE is that the RowRowConverter in the map function is not initialized by calling RowRowConverter::open. You can use RichMapFunction instead to invoke the RowRowConverter::open in RichMapFunction::open.

Thank you to #renqs for the answer.
Here is the code, if anyone is interested.
class ConvertRowToRowDataMapFunction(fieldsDataType: FieldsDataType)
extends RichMapFunction[Row, RowData] {
private final val rowRowConverter = RowRowConverter.create(fieldsDataType)
override def open(parameters: Configuration): Unit = {
super.open(parameters)
rowRowConverter.open(this.getClass.getClassLoader)
}
override def map(row: Row): RowData =
this.rowRowConverter.toInternal(row)
}
// at main function
// ... continue from previous
val dataStream = tableEnv
.toDataStream(personsTable)
.map(new ConvertRowToRowDataMapFunction(dataType))

Training/Test data with SparkML in Scala

I've been facing with an issue for the past couple of hours.
In theory, when we split data for training and testing, we should standardize the data for training independently, so as not to introduce bias, and then after having trained the model do we standardize the test set using the same "parameter" values as for the training set.
So far I've only managed to do it without the pipeline, looking like this:
val training = splitData(0)
val test = splitData(1)
val assemblerTraining = new VectorAssembler()
.setInputCols(training.columns)
.setOutputCol("features")
val standardScaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("normFeatures")
.setWithStd(true)
.setWithMean(true)
val scalerModel = standardScaler.fit(training)
val scaledTrainingData = scalerModel.transform(training)
val scaledTestData = scalerModel.transform(test)
How would I go about implementing this with pipelines?
My issue is that if I create a pipeline like so:
val pipelineTraining = new Pipeline()
.setStages(
Array(
assemblerTraining,
standardScaler,
lr
)
)
where lr is a LinearRegression, then there is no way to actually access the scaling model from inside the pipeline.
I've also thought of using an intermediary pipeline to do the scaling like so:
val pipelineScalingModel = new Pipeline()
.setStages(Array(assemblerTraining, standardScaler))
.fit(training)
val pipelineTraining = new Pipeline()
.setStages(Array(pipelineScalingModel,lr))
val scaledTestData = pipelineScalingModel.transform(test)
But I don't know if this is the right way of going about it.
Any suggestions would be greatly appreciated.

In case anybody else meets with this issue, this is how I proceeded:
I realized I was not allowed to modify the [forbiddenColumnName] variable.Therefore I gave up on trying to use pipelines in that phase.
I created my own standardizing function and called it for each individual feature, like so:
def standardizeColumn( dfTrain : DataFrame, dfTest : DataFrame, columnName : String) : Array[DataFrame] = {
val withMeanStd = dfTrain.select(mean(col(columnName)), stddev(col(columnName))).collect
val auxDFTrain = dfTrain.withColumn(columnName, (col(columnName) - withMeanStd(0).getDouble(0))/withMeanStd(0).getDouble(1))
val auxDFTest = dfTest.withColumn(columnName, (col(columnName) - withMeanStd(0).getDouble(1))/withMeanStd(0).getDouble(1))
Array(auxDFTrain, auxDFTest)
}
for (columnName <- training.columns){
if ((columnName != [forbiddenColumnName]) && (columnExists(training, columnName))){
val auxResult = standardizeColumn(training, test, columnName)
training = auxResult(0)
test = auxResult(1)
}
}
[MENTION] My number of variables is very low ~15, therefore this is not a very lenghty process. I seriously doubt this would be the right way of going about things on much bigger datasets.

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?

As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)

Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

Spark read multiple directories into multiple dataframes

I have a directory structure on S3 looking like this:
foo
|-base
|-2017
|-01
|-04
|-part1.orc, part2.orc ....
|-A
|-2017
|-01
|-04
|-part1.orc, part2.orc ....
|-B
|-2017
|-01
|-04
|-part1.orc, part2.orc ....
Meaning that for directory foo I have multiple output tables, base, A, B, etc in a given path based on the timestamp of a job.
I'd like to left join them all, based on a timestamp and the master directory, in this case foo. This would mean reading in each output table base, A, B, etc into new separate input tables on which a left join can be applied. All with the base table as starting point
Something like this (not working code!)
val dfs: Seq[DataFrame] = spark.read.orc("foo/*/2017/01/04/*")
val base: DataFrame = spark.read.orc("foo/base/2017/01/04/*")
val result = dfs.foldLeft(base)((l, r) => l.join(r, 'id, "left"))
Can someone point me in the right direction on how to get that sequence of DataFrames? It might even be worth considering the reads as lazy, or sequential, thus only reading the A or B table when the join is applied to reduce memory requirements.
Note: the directory structure is not final, meaning it can change if that fits the solution.

From what I understand Spark uses the underlying Hadoop API to read in data file. So the inherited behavior is to read everything you specify into one single RDD/DataFrame.
To achieve what you want, you can first get a list of directories with:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{ FileSystem, Path }
val path = "foo/"
val hadoopConf = new Configuration()
val fs = FileSystem.get(hadoopConf)
val paths: Array[String] = fs.listStatus(new Path(path)).
filter(_.isDirectory).
map(_.getPath.toString)
Then load them into separated dataframes:
val dfs: Array[DataFrame] = paths.
map(path => spark.read.orc(path + "/2017/01/04/*"))

Here's a straight-forward solution to what (I think) you're trying to do, with no use of extra features like Hive or build-in partitioning abilities:
import spark.implicits._
// load base
val baseDF = spark.read.orc("foo/base/2017/01/04").as("base")
// create or use existing Hadoop FileSystem - this should use the actual config and path
val fs = FileSystem.get(new URI("."), new Configuration())
// find all other subfolders under foo/
val otherFolderPaths = fs.listStatus(new Path("foo/"), new PathFilter {
override def accept(path: Path): Boolean = path.getName != "base"
}).map(_.getPath)
// use foldLeft to join all, using the DF aliases to find the right "id" column
val result = otherFolderPaths.foldLeft(baseDF) { (df, path) =>
df.join(spark.read.orc(s"$path/2017/01/04").as(path.getName), $"base.id" === $"${path.getName}.id" , "left") }

How to stream Anorm large query results to client in chunked response with Play 2.5

I have a pretty large result set (60k+ records columns) that I am pulling from a database and parsing with Anorm (though I can use play's default data access module that returns a ResultSet if needed). I need to transform and stream these results directly to the client (without holding them in a big list in memory) where they will then be downloaded directly to a file on the client's machine.
I have been referring to what is demonstrated in the Chunked Responses section in the ScalaStream 2.5.x Play documentation. I am having trouble implementing the "getDataStream" portion of what it shows there.
I've also been referencing what is demoed in the Streaming Results and Iteratee sections in the ScalaAnorm 2.5.x Play documentation. I have tried piping the results as an enumerator like what is returned here:
val resultsEnumerator = Iteratees.from(SQL"SELECT * FROM Test", SqlParser.str("colName"))
into
val dataContent = Source.fromPublisher(Streams.enumeratorToPublisher(resultsEnumerator))
Ok.chunked(dataContent).withHeaders(("ContentType","application/x-download"),("Content-disposition","attachment; filename=myDataFile.csv"))
But the resulting file/content is empty.
And I cannot find any sample code or references on how to convert a function in the data service that returns something like this:
#annotation.tailrec
def go(c: Option[Cursor], l: List[String]): List[String] = c match {
case Some(cursor) => {
if (l.size == 10000000) l // custom limit, partial processing
else {
go(cursor.next, l :+ cursor.row[String]("VBU_NUM"))
}
}
case _ => l
}
val sqlString = s"select colName FROM ${tableName} WHERE ${whereClauseStr}"
val results : Either[List[Throwable], List[String]] = SQL(sqlString).withResult(go(_, List.empty[String]))
results
into something i can pass to Ok.chunked().
So basically my question is, how should I feed each record fetch from the database into a stream that I can do a transformation on and send to the client as a chunked response that can be downloaded to a file?
I would prefer not to use Slick for this. But I can go with a solution that does not use Anorm, and just uses the play dbApi objects that returns the raw java.sql.ResultSet object and work with that.

After referencing the Anorm Akka Support documentation and much trial and error, I was able to achieve my desired solution. I had to add these dependencies
"com.typesafe.play" % "anorm_2.11" % "2.5.2",
"com.typesafe.play" % "anorm-akka_2.11" % "2.5.2",
"com.typesafe.akka" %% "akka-stream" % "2.4.4"
to by build.sbt file for Play 2.5.
and I implemented something like this
//...play imports
import anorm.SqlParser._
import anorm._
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Sink, Source}
...
private implicit val akkaActorSystem = ActorSystem("MyAkkaActorSytem")
private implicit val materializer = ActorMaterializer()
def streamedAnormResultResponse() = Action {
implicit val connection = db.getConnection()
val parser: RowParser[...] = ...
val sqlQuery: SqlQuery = SQL("SELECT * FROM table")
val source: Source[Map[String, Any] = AkkaStream.source(sqlQuery, parser, ColumnAliaser.empty).alsoTo(Sink.onComplete({
case Success(v) =>
connection.close()
case Failure(e) =>
println("Info from the exception: " + e.getMessage)
connection.close()
}))
Ok.chunked(source)
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how to parse a CSV file with dataVec using a schema? - scala

Related

Flink: RowRowConverter seems to fail for nested DataTypes

Training/Test data with SparkML in Scala

Spark: How to get String value while generating output file

Spark read multiple directories into multiple dataframes

How to stream Anorm large query results to client in chunked response with Play 2.5

Categories

Resources