ingesting data in solr using spark scala

ingesting data in solr using spark scala - scala

I am trying to ingest data to solr using scala and spark however, my code is missing something. For instance, I got below code from Hortonworks tutorial.
I am using spark 1.6.2, solr 5.2.1, scala 2.10.5.
Can anybody provide me a workable snippet to successfully insert data into solr?
val input_file = "hdfs:///tmp/your_text_file"
case class Person(id: Int, name: String)
val people_df1 = sc.textFile(input_file).map(_.split(",")).map(p => Person(p(0).trim.toInt, p(1))).toDF()
val docs = people_df1.map{doc=>
val docx=SolrSupport.autoMapToSolrInputDoc(doc.getAs[Int]("id").toString, doc, null)
docx.setField("scala_s", "supercool")
docx.setField("name_s", doc.getAs[String]("name"))
}
// below code has compilation issue somehow although jar file doest contain these functions.
SolrSupport.indexDocs("sandbox.hortonworks.com:2181","testsparksolr",10,docs)
val solrServer = com.lucidworks.spark.SolrSupport.getSolrServer("http://ambari.asiacell.com:2181")
solrServer.setDefaultCollection("
testsparksolr")
solrServer.commit(false, false)
thanks in advance

Have you tried spark-solr?
The library's main focus is to provide a clean API to index documents to a Solr server as in your case.

Related

Spark Scala file stream

I am new to Spark and Scala. I want to keep read files from folder and persist file content in Cassandra. I have written simple Scala program using file streaming to read the file content. it is not reading files from the specified folder.
Can anybody correct my below sample code ?
i am using Windows 7
Code:
val spark = SparkHelper.getOrCreateSparkSession()
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val lines = ssc.textFileStream("file:///C:/input/")
lines.foreachRDD(file=> {
file.foreach(fc=> {
println(fc)
})
})
ssc.start()
ssc.awaitTermination()
}

I think a normal spark job is needed for the scenario rather than spark streaming.Spark streaming is used in cases where your source is something like kafka or a normal port where there is continuous inflow of data.

Avro Files always read as GenericRecord

I have a avro files with a specified schema.
When I am loading the Avro Files they always come out as GenericData even though I am specifying the Schema.
val schema = Article.Schema$
val job = new Job()
AvroJob.setInputKeySchema(job, schema)
val rootDir = "path-to-avro-files"
val articlesRDD = sc.newAPIHadoopFile(rootDir, classOf[AvroKeyInputFormat[Article]], classOf[AvroKey[Article]], classOf[NullWritable], job.getConfiguration)
This code works and I get an RDD with the data contained in the avro files but unfortunately the entries of the RDD are all of type GenericData. This means whenever I want to access a field of my specific schema, I am getting the following error:
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to de.uni_mannheim.desq.converters.nyt.avroschema.Article
This is the code I use to extract a field from the avro file
val abstracts = articlesRDD.map(tuple => {
val abstract = tuple._1.datum.getAbstract
abstract
}
Also calling 'asInstanceOf' after accessing the 'datum' (in order to convert the GenericRecord to my Article) leads to the same error.

So I ended up following this tutorial (http://subprotocol.com/system/apache-spark-ec2-avro.html) and regenerating my AvroSchema with a newer version of the Avro-tools.
The version generated with avro-tools 1.7.x is not working with this solution while the version generated with 1.8.1 does.

Converting a sequence of Json Object to An Rdd

Iam currently having a json object say student.json. The Structure looks something like this
{"serialNo":"1","name":"Rahul"}
{"serialNo":"2","name":"Rakshith"}
case class Student(serialNo:Int,name:String)
student.json is a huge file which Iam planning to parse through a spark job. And the snippet :
import play.api.libs.json.{ Json, JsObject, JsString }
.....
.....
for(jsonLine <-sc.textFile("student.json")
student<- Json.parse(jsonLine).asOpt[Student])
yield(student.serialNumber -> student.name)
Is there a better way to do this??

If student.json is a huge file, and each line is just a valid json object, you should do:
val myRdd = sc.textFile("student.json").map(l=> Json.parse(l).asOpt[Student])
If you want to get the RDD to your local master, you can:
val students = myRdd.collect()..// then you can do operate it in the old fashion way.
I saw you are importing play.api.libs.json which is from the Play Framework. I don't think running a Spark program in a web application is a good idea...

Self-consistent and updated example of using Spark over ElasticSearch

This guy had a very small example that showed how to integrate ElasticSearch and Spark, when all the ES ecosystem was around version 0.9. Nowadays, it doesn't work anymore (and googling for it doesn't seem an easy feat). Can someone give a small, self-contained Scala example of:
Opening a file in spark (in the example above, it was /var/log/syslog);
Doing something with it;
Sending the result into ES;
Opening that result back in Spark.
... that works with ElasticSearch 1.3.4 and Spark 1.1.0.

I gave a talk awhile back with Spark and Elastic Search (around the 0.9 days), and I recently updated some of the examples for present day (read 1.1). I've posted the slides and the example code. Hope that helps!
I've also copied the relevant sections (from my own github repo) here:
import org.elasticsearch.spark.sql._
...
val tweetsAsCS =
createSchemaRDD(tweetRDD.map(SharedIndex.prepareTweetsCaseClass))
tweetsAsCS.saveToEs(esResource)
Note that we didn't specify any ES nodes. This will default to trying to save to a cluster on local host. If we want to use a different cluster we can add:
// if we want to have a different es cluster we can add
import org.elasticsearch.hadoop.cfg.ConfigurationOptions
val config = new SparkConf()
config.set(ConfigurationOptions.ES_NODES, node) // set the node for discovery
// other config settings
val sc = new SparkContext(config)
So that will do the first part (indexing some data).
Querying ES from Spark has also gotten a lot simpler, although only if your data types are supported by the mappings of the connector (the primary one I ran into that wasn't was geolocation but its easy enough to extend the mapper if you run into this).
val query = "{\"query\": {\"filtered\" : {\"query\" : {\"match_all\" : {}},\"filter\" : { \"geo_distance\" : { \"distance\" : \""+ dist + "km\", \"location\" : { \"lat\" : "+ lat +", \"lon\" : "+ lon +" }}}}}}"
val tweets = sqlCtx.esRDD(esResource, query)
The esRDD function isn't normally on the SQLContext, but the implicit conversions we imported up above make it available to us. tweets is now a SchemaRDD and we can update it as desired and save the results back as we did in the first part of this example.
Hope this helps!

How can I connect to a postgreSQL database in scala?

I want to know how can I do following things in scala?
Connect to a postgreSQL database.
Write SQL queries like SELECT , UPDATE etc. to modify a table in that database.
I know that in python I can do it using PygreSQL but how to do these things in scala?

You need to add dependency "org.postgresql" % "postgresql" % "9.3-1102-jdbc41" in build.sbt and you can modify following code to connect and query database. Replace DB_USER with your db user and DB_NAME as your db name.
import java.sql.{Connection, DriverManager, ResultSet}
object pgconn extends App {
println("Postgres connector")
classOf[org.postgresql.Driver]
val con_st = "jdbc:postgresql://localhost:5432/DB_NAME?user=DB_USER"
val conn = DriverManager.getConnection(con_str)
try {
val stm = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)
val rs = stm.executeQuery("SELECT * from Users")
while(rs.next) {
println(rs.getString("quote"))
}
} finally {
conn.close()
}
}

I would recommend having a look at Doobie.
This chapter in the "Book of Doobie" gives a good sense of what your code will look like if you make use of this library.
This is the library of choice right now to solve this problem if you are interested in the pure FP side of Scala, i.e. scalaz, scalaz-stream (probably fs2 and cats soon) and referential transparency in general.
It's worth nothing that Doobie is NOT an ORM. At its core, it's simply a nicer, higher-level API over JDBC.

Take look at the tutorial "Using Scala with JDBC to connect to MySQL", replace the db url and add the right jdbc library. The link got broken so here's the content of the blog:
Using Scala with JDBC to connect to MySQL
A howto on connecting Scala to a MySQL database using JDBC. There are a number of database libraries for Scala, but I ran into a problem getting most of them to work. I attempted to use scala.dbc, scala.dbc2, Scala Query and Querulous but either they aren’t supported, have a very limited featured set or abstracts SQL to a weird pseudo language.
The Play Framework has a new database library called ANorm which tries to keep the interface to basic SQL but with a slight improved scala interface. The jury is still out for me, only used on one project minimally so far. Also, I’ve only seen it work within a Play app, does not look like it can be extracted out too easily.
So I ended up going with basic Java JDBC connection and it turns out to be a fairly easy solution.
Here is the code for accessing a database using Scala and JDBC. You need to change the connection string parameters and modify the query for your database. This example was geared towards MySQL, but any Java JDBC driver should work the same with Scala.
Basic Query
import java.sql.{Connection, DriverManager, ResultSet};
// Change to Your Database Config
val conn_str = "jdbc:mysql://localhost:3306/DBNAME?user=DBUSER&password=DBPWD"
// Load the driver
classOf[com.mysql.jdbc.Driver]
// Setup the connection
val conn = DriverManager.getConnection(conn_str)
try {
// Configure to be Read Only
val statement = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)
// Execute Query
val rs = statement.executeQuery("SELECT quote FROM quotes LIMIT 5")
// Iterate Over ResultSet
while (rs.next) {
println(rs.getString("quote"))
}
}
finally {
conn.close
}
You will need to download the mysql-connector jar.
Or if you are using maven, the pom snippets to load the mysql connector, you’ll need to check what the latest version is.
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.12</version>
</dependency>
To run the example, save the following to a file (query_test.scala) and run using, the following specifying the classpath to the connector jar:
scala -cp mysql-connector-java-5.1.12.jar:. query_test.scala
Insert, Update and Delete
To perform an insert, update or delete you need to create an updatable statement object. The execute command is slightly different and you will most likely want to use some sort of parameters. Here’s an example doing an insert using jdbc and scala with parameters.
// create database connection
val dbc = "jdbc:mysql://localhost:3306/DBNAME?user=DBUSER&password=DBPWD"
classOf[com.mysql.jdbc.Driver]
val conn = DriverManager.getConnection(dbc)
val statement = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_UPDATABLE)
// do database insert
try {
val prep = conn.prepareStatement("INSERT INTO quotes (quote, author) VALUES (?, ?) ")
prep.setString(1, "Nothing great was ever achieved without enthusiasm.")
prep.setString(2, "Ralph Waldo Emerson")
prep.executeUpdate
}
finally {
conn.close
}

We are using Squeryl, which is working well so far for us. Depending on your needs it may do the trick.
Here is a list of supported DB's and the adapters

If you want/need to write your own SQL, but hate the JDBC interface, take a look at O/R Broker

I would recommend the Quill query library. Here is an introduction post by Li Haoyi to get started.
TL;DR
{
import io.getquill._
import com.zaxxer.hikari.{HikariConfig, HikariDataSource}
val pgDataSource = new org.postgresql.ds.PGSimpleDataSource()
pgDataSource.setUser("postgres")
pgDataSource.setPassword("example")
val config = new HikariConfig()
config.setDataSource(pgDataSource)
val ctx = new PostgresJdbcContext(LowerCase, new HikariDataSource(config))
import ctx._
}
Define a class ORM:
// mapping `city` table
case class City(
id: Int,
name: String,
countryCode: String,
district: String,
population: Int
)
and query all items:
# ctx.run(query[City])
cmd11.sc:1: SELECT x.id, x.name, x.countrycode, x.district, x.population FROM city x
val res11 = ctx.run(query[City])
^
res11: List[City] = List(
City(1, "Kabul", "AFG", "Kabol", 1780000),
City(2, "Qandahar", "AFG", "Qandahar", 237500),
City(3, "Herat", "AFG", "Herat", 186800),
...

ScalikeJDBC is quite easy to use. It allows you to write raw SQL using interpolated strings.
http://scalikejdbc.org/

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

ingesting data in solr using spark scala - scala

Have you tried spark-solr? The library's main focus is to provide a clean API to index documents to a Solr server as in your case.

Related

Spark Scala file stream

Avro Files always read as GenericRecord

Converting a sequence of Json Object to An Rdd

Self-consistent and updated example of using Spark over ElasticSearch

How can I connect to a postgreSQL database in scala?

Categories

Resources