Efficient way to load csv file in spark/scala

Efficient way to load csv file in spark/scala - scala

I am trying to load a csv file in scala from spark. I see that we can do using the below two different syntaxes:
sqlContext.read.format("csv").options(option).load(path)
sqlContext.read.options(option).csv(path)
What is the difference between these two and which gives the better performance?
Thanks

There's no difference.
So why do both exist?
The .format(fmt).load(path) method is a flexible, pluggable API that allows adding more formats without having to re-compile spark - you can register aliases for custom Data Source implementations and have Spark use them; "csv" used to be such a custom implementation (outside of the packaged Spark binaries), but it is now part of the project
There are shorthand methods for "built-in" data sources (like csv, parquet, json...) which make the code a bit simpler (and verified at compile time)
Eventually, they both create a CSV Data Source and use it to load the data.
Bottom line, for any supported format, you should opt for the "shorthand" method, e.g. csv(path).

Related

What's the best way to store a huge Map object populated at runtime to be reused by another tool?

I'm writing a Scala tool that encodes ~300 JSON Schema files into files of a different format and saves them to disk. These schemas I later re-need for instantiating JSON Data files, or better, I don't need all the schemas but only a few fields of each.
I was thinking that the best solution could be to populate a Map object (while the tool encodes the schemas) containing only the info that I need. And later re-use the Map object (in another run of the tool) as already compiled and populated map.
I've got two questions:
1. Is this really the most performant solution? and
2. How can I save the Map object, created at runtime, on disk as a file that can be later built/executed with the rest of my code?
I've read several posts about serialization and storing objects, but I'm not entirely sure whether these are the same as what I need. Also, I'm not sure is the best solution and I would like to hear an opinion from people with more experience than me.
What I would like to achieve is an elegant solution that allows me to lookup values from a map generated by another tool.
The whole process of compiling/building/executing sometimes is still confusing to me, so apologies if the question is trivial.

To Answer your question,
I think using an embedded KV Store would be more efficient considering the number of files and amount of traversal.
Here is a small Wiki on "How to use RocksJava". You can consider it as a KV store. https://github.com/facebook/rocksdb/wiki/RocksJava-Basics
You can use the below reference to serialize and de-serialize an object in Scala and put it as Key value pair in the RocksDB as I mentioned in the comment.
Convert Any type in scala to Array[Byte] and back
On how to use rocksDB, the below dependency in your build will suffice:
"org.rocksdb" % "rocksdbjni" % "5.17.2"
Thanks.

How can one use spark Catalyst?

According to this
Spark Catalyst is An implementation-agnostic framework for manipulating trees of relational operators and expressions.
I want to use Spark Catalyst to parse SQL DMLs and DDLs to write and generate custom Scala code for. However, it is not clear for me by reading the code if there is any wrapper class around Catalyst that I can use? The ideal wrapper would receive a sql statement and produces the equivalent Scala code. For my use case would look like this
def generate("select substring(s, 1, 3) as from t1") =
{ // custom code
return custom_scala_code_which is executable given s as List[String]
}
This is a simple example, but the idea is that I don't want to write another parser and I need to parse many SQL functionality from a legacy system that I have to write a custom Scala implementation for them.
In a more general question, with a lack of class level design documentation, how can someone learn the code base and make contributions?

Spark takes SQL queries using spark.sql. For example: you can just feed the string SELECT * FROM table as an argument to such as spark.sql("SELECT * FROM table") after having defined your dataframe as "table". To define your dataframe as "table" for use in SQL queries create a temporary view using
DataFrame.createOrReplaceTempView("table")
You can see examples here:
https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#running-sql-queries-programmatically

Dataframe automatically changes into RDD and optimise the code, and this optimization is done through Catalyst. When a programmer writes a code in Dataframe , internally code will be optimized. For more detail visit
Catalyst optimisation in Spark

Reading csv file by Flink, scala, addSource and readCsvFile

I'd want to read csv file using by Flink, Scala-language and addSource- and readCsvFile-functions. I have not found any simple examples about that. I have only found: https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/scala/com/dataartisans/flinktraining/exercises/datastream_scala/cep/LongRides.scala and this too complex for my purpose.
In definition: StreamExecutionEnvironment.addSource(sourceFunction) should i only use readCsvFile as sourceFunction ?
After reading i'd want to use CEP (Complex Event Processing).

readCsvFile() is only available as part of Flink's DataSet (batch) API, and cannot be used with the DataStream (streaming) API. Here's a pretty good example of readCsvFile(), though it's probably not relevant to what you're trying to do.
readTextFile() and readFile() are methods on StreamExecutionEnvironment, and do not implement the SourceFunction interface -- they are not meant to be used with addSource(), but rather instead of it. Here's an example of using readTextFile() to load a CSV using the DataStream API.
Another option is to use the Table API, and a CsvTableSource. Here's an example and some discussion of what it does and doesn't do. If you go this route, you'll need to use StreamTableEnvironment.toAppendStream() to convert your table stream to a DataStream before using CEP.
Keep in mind that all of these approaches will simply read the file once and create a bounded stream from its contents. If you want a source that reads in an unbounded CSV stream, and waits for new rows to be appended, you'll need a different approach. You could use a custom source, or a socketTextStream, or something like Kafka.

If you have a CSV file with 3 fields - String,Long,Integer
then do below
val input=benv.readCsvFile[(String,Long,Integer)]("hdfs:///path/to/your_csv_file")
PS:-I am using flink shell that is why I have benv

How can you share Transformers across mirth channels

We are using appliance based mirth connect ver 3.4.2
We have few transformers which are common to all the channels but still they are under each channel. Anytime we have to modify something, we have to make changes in all channels.
We have transformers for
some functions with javascript and java code
some mappings
some database operations like inserts etc
Can we put this code somewhere where it is shared across channels and we don't need to write transformers under each channel ?
Thanks
Sid

A good way to do this is to move common code (functions, database operations, etc) into code templates.

some functions with javascript - Edit Code Templates will be a place where you can provide common codes which has to go for all channels.
some database operations like inserts - I believe/(good practice) these should be specific to channels, and if you have functions specific to certain channel and used in many places in that specific channel, then declare that function in modes of process needed like either in deploy,pre-processor,undeploy or post-processor.
some mappings - I'm not sure about this. If you choose Javascript for mapping we can achieve this mapping by making it as a global variable in global script places or coded templates.
some JAVA code - If it is a JAVA code, and a library built to invoke script on top of the library, then make the JAVA library to have get and set objects that way you can traverse to any depth on your Mirth script to access JAVA objects
For Eg: If you are building XML, there are many libraries you can use like Stax parser, JDOM etc, but using a document builder factory for developing XML will allow you to access JAVA objects to depth in Mirth script .

Write to multiple outputs by key Scalding Hadoop, one MapReduce Job

How can you write to multiple outputs dependent on the key using Scalding(/cascading) in a single Map Reduce Job. I could of course use .filter for all the possible keys, but that is a horrible hack, which will fire up many jobs.

There is TemplatedTsv in Scalding (from version 0.9.0rc16 and up), exactly same as Cascading TemplateTsv.
Tsv(args("input"), ('COUNTRY, 'GDP))
.read
.write(TemplatedTsv(args("output"), "%s", 'COUNTRY))
// it will create a directory for each country under "output" path in Hadoop mode.

Use MultipleOutputFormat and extrapolate from these other SO questions to write a custom output class using the output format:
Create Scalding Source like TextLine that combines multiple files into single mappers,
Compress Output Scalding / Cascading TsvCompressed

This suggestion on the Cascading User group suggests to use Cascading TemplateTap. Not sure how to connect this to Scalding though.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Efficient way to load csv file in spark/scala - scala

I am trying to load a csv file in scala from spark. I see that we can do using the below two different syntaxes: sqlContext.read.format("csv").options(option).load(path) sqlContext.read.options(option).csv(path) What is the difference between these two and which gives the better performance? Thanks

Related

What's the best way to store a huge Map object populated at runtime to be reused by another tool?

How can one use spark Catalyst?

Reading csv file by Flink, scala, addSource and readCsvFile

How can you share Transformers across mirth channels

Write to multiple outputs by key Scalding Hadoop, one MapReduce Job

Categories

Resources