How to load csv couple of lines per couple of lines - scala

I'm connecting Spark to Cassandra and I was able to print the lines of my CSV using the conventional COPY method. However, if the CSV was very large as it usually happens in Big Data, how could one load the CSV file couple of lines per couple of lines in order to avoid freezing related issues etc. ?
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
object SparkCassandra {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("SparkCassandra").setMaster("local").set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val my_rdd = sc.cassandraTable("my_keyspace", "my_csv")
my_rdd.take(20).foreach(println)
sc.stop()
}
}
Should one use a time variable or something of that nature?

If you want just to load data into Cassandra, or unload data from Cassandra using the command-line, I would recommend to look to the DataStax Bulk Loader (DSBulk) - it's heavily optimized for loading data to/from Cassandra/DSE. It works with both open source Cassandra and DSE.
In simplest case loading into & unloading from table will look as (default format is CSV):
dsbulk load -k keyspace -t table -url my_file.csv
dsbulk unload -k keyspace -t table -url my_file.csv
For more complex cases you may need to provide more options. You can find more information in following series of the blog posts.
If you want to do this with Spark, then I recommend to use Dataframe API instead of RDDs. In this case you'll just use standard read & write functions.
to export data from Cassandra to CSV:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("tbl", "ks").load()
data.write.format("csv").save("my_file.csv")
or read from CSV and store in Cassandra:
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.SaveMode
val data = spark.read.format("csv").save("my_file.csv")
data.cassandraFormat("tbl", "ks").mode(SaveMode.Append).save()

Related

Write AVRO from spark-shell in Spark 2.4

Spark 2.4.0 on Java 1.8.0_161 (Scala 2.11.12)
Run command: spark-shell --jars=spark-avro_2.11-2.4.0.jar
Currently working on some POC using small avro files, I want to be able to read in a (single) AVRO file, make a change, then write it back out.
Reading is fine:
val myAv = spark.read.format("avro").load("myAvFile.avro")
However, I am getting this error when trying to write back out (even before making any changes):
scala> myAv.write.format("avro").save("./output-av-file.avro")
org.apache.spark.sql.AnalysisException:
Datasource does not support writing empty or nested empty schemas.
Please make sure the data schema has at least one or more column(s).
;
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$validateSchema(DataSource.scala:733)
at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:523)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:281)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
... 49 elided
I've tried specifying the schema of the dataframe manually, but to no avail:
.write.option("avroSchema", c_schema.toString).format("avro") ...
Reason is quite obvious schema is coming as empty. see here from code
if (hasEmptySchema(schema)) {
throw new AnalysisException(
s"""
|Datasource does not support writing empty or nested empty schemas.
|Please make sure the data schema has at least one or more column(s).
""".stripMargin)
}

Spark Structured Streaming from Files on S3/Disk - add batch filename to records/lines?

I am implementing a spark structured streaming application that processes webserver log files from a folder on disk or perhaps S3.
Spark Structured Streaming fits the use case almost perfectly, with one wrinkle.
The filenames in the folder also contain the machine name eg. like:
/node1_20181101.json.gz
/node1_20181102.json.gz
/node2_20181101.json.gz
/node3_20181102.json.gz
/node4_20181102.json.gz
...and so on.
A (simplified) version of the source looks something like this ( I would turn the below to a continuous stream with windowing etc):
val inputDF = spark.read
.option("codec", classOf[GzipCodec].getName)
.option("maxFilesPerTrigger", 1.toString)
.json(config.directory)
.transform { ds =>
logger.info(ds.inputFiles)
ds
}.foreach(println(_))
I would like to transform the batch and add the node ID from the filename to each record line, - I can't seem to see any kind of an onBatch trigger that I could use to enrich the record schema with the node ID from the file name.
I have looked at the following and nothing seems to fit:
[FileStreamSource][https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html#metadataLog]
Unfortunately getting a handle on the machine name from the file name is key to the analytics I do later, and I have no control over how the logs are populated
Any clues?

Why does using cache on streaming Datasets fail with "AnalysisException: Queries with streaming sources must be executed with writeStream.start()"?

SparkSession
.builder
.master("local[*]")
.config("spark.sql.warehouse.dir", "C:/tmp/spark")
.config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint")
.appName("my-test")
.getOrCreate
.readStream
.schema(schema)
.json("src/test/data")
.cache
.writeStream
.start
.awaitTermination
While executing this sample in Spark 2.1.0 I got error.
Without the .cache option it worked as intended but with .cache option i got:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[src/test/data]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:196)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:33)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:33)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:58)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:67)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:102)
at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:65)
at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:89)
at org.apache.spark.sql.Dataset.persist(Dataset.scala:2479)
at org.apache.spark.sql.Dataset.cache(Dataset.scala:2489)
at org.me.App$.main(App.scala:23)
at org.me.App.main(App.scala)
Any idea?
Your (very interesting) case boils down to the following line (that you can execute in spark-shell):
scala> :type spark
org.apache.spark.sql.SparkSession
scala> spark.readStream.text("files").cache
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[files]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104)
at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68)
at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92)
at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603)
at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613)
... 48 elided
The reason for this turned out quite simple to explain (no pun to Spark SQL's explain intended).
spark.readStream.text("files") creates a so-called streaming Dataset.
scala> val files = spark.readStream.text("files")
files: org.apache.spark.sql.DataFrame = [value: string]
scala> files.isStreaming
res2: Boolean = true
Streaming Datasets are the foundation of Spark SQL's Structured Streaming.
As you may have read in Structured Streaming's Quick Example:
And then start the streaming computation using start().
Quoting the scaladoc of DataStreamWriter's start:
start(): StreamingQuery Starts the execution of the streaming query, which will continually output results to the given path as new data arrives.
So, you have to use start (or foreach) to start the execution of the streaming query. You knew it already.
But...there are Unsupported Operations in Structured Streaming:
In addition, there are some Dataset methods that will not work on streaming Datasets. They are actions that will immediately run queries and return results, which does not make sense on a streaming Dataset.
If you try any of these operations, you will see an AnalysisException like "operation XYZ is not supported with streaming DataFrames/Datasets".
That looks familiar, doesn't it?
cache is not in the list of the unsupported operations, but that's because it has simply been overlooked (I reported SPARK-20927 to fix it).
cache should have been in the list as it does execute a query before the query gets registered in Spark SQL's CacheManager.
Let's go deeper into the depths of Spark SQL...hold your breath...
cache is persist while persist requests the current CacheManager to cache the query:
sparkSession.sharedState.cacheManager.cacheQuery(this)
While caching a query CacheManager does execute it:
sparkSession.sessionState.executePlan(planToCache).executedPlan
which we know is not allowed since it is start (or foreach) to do so.
Problem solved!

How to cache Dataframe in Apache ignite

I am writing a code to cache RDBMS data using spark SQLContext JDBC connection. Once a Dataframe is created I want to cache that reusltset using apache ignite thereby making other applications to make use of the resultset. Here is the code snippet.
object test
{
def main(args:Array[String])
{
val configuration = new Configuration()
val config="src/main/scala/config.xml"
val sparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc=new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sql_dump1=sqlContext.read.format("jdbc").option("url", "jdbc URL").option("driver", "com.mysql.jdbc.Driver").option("dbtable", mysql_table_statement).option("user", "username").option("password", "pass").load()
val ic = new IgniteContext[Integer, Integer](sc, config)
val sharedrdd = ic.fromCache("hbase_metadata")
//How to cache sql_dump1 dataframe
}
}
Now the question is how to cache a dataframe, IgniteRDD has savepairs method but it accepts key and value as RDD[Integer], but I have a dataframe even if I convert that to RDD i would only be getting RDD[Row]. The savepairs method consisting of RDD of Integer seems to be more specific what if I have a string of RDD as value? Is it good to cache dataframe or any other better approach to cache the resultset.
There is no reason to store DataFrame in an Ignite cache (shared RDD) since you won't benefit from it too much: at least you won't be able to execute Ignite SQL over the DataFrame.
I would suggest doing the following:
provide CacheStore implementation for hbase_metadata cache that will preload all the data from your underlying database. Then you can preload all the data into the cache using Ignite.loadCache method. Here you may find an example on how to use JDBC persistent stores along with Ignite cache (shared RDD)
use Ignite Shared RDD sql api to query over cached data.
Alternatively you can get sql_dump1 as you're doing, iterate over each row and store each row individually in the shared RDD using IgniteRDD.savePairs method. After this is done you can query over data using the same Ignite Shared RDD SQL mentioned above.

Spark - ElasticSearch Index creation performance too slow

I am trying to use Apache spark to create an index in Elastic search(Writing huge data to ES).I have done a Scala program to create index using Apache spark.I have to index huge data, which is getting as my product bean in a LinkedList. Then.Then i tried to traverse over the product bean list and create the index. My code given below.
val conf = new SparkConf().setAppName("ESIndex").setMaster("local[*]")
conf.set("es.index.auto.create", "true").set("es.nodes", "127.0.0.1")
.set("es.port", "9200")
.set("es.http.timeout", "5m")
.set("es.scroll.size", "100")
val sc = new SparkContext(conf)
//Return my product bean as a in a linkedList.
val list: util.LinkedList[product] = getData()
for (item <- list) {
sc.makeRDD(Seq(item)).saveToEs("my_core/json")
}
The issue with this approach is taking too much time to create the index.
Is there any way to create the index in a better way?
Don't pass data through driver unless it is necessary. Depending on what is the source of data returned from getData you should use relevant input method or create your own. If data comes from MongoDB use for example mongo-hadoop, Spark-MongoDB or Drill with JDBC connection. Then use map or similar method to build the required objects and use saveToEs on transformed RDD.
Creating a RDD with as single element doesn't make sense. It doesn't benefit from Spark architecture at all. You just start a potentially huge number of tasks which have nothing with only a single active executor.