I'm using Apache Tika to do text extraction and I have to handle scanned PDF images. So I'm trying Tesseract, but I'm having problems finding any good resource on good default settings…?
I'm also experiencing what seems like weird post-processing artifacts:
I get this:
"och ptensionskos nader"
from this image:
It really seems some post-processing has moved the t to the beginning of the word and left a blank instead. Seems super-weird to me why it would do this unless there's some very bad post-processing settings.
These are my basic settings from Apache Tika:
val pdfConfig: PDFParserConfig = {
val pdfConf = new PDFParserConfig()
pdfConf.setOcrDPI(150)
pdfConf.setDetectAngles(false)
pdfConf.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_ONLY)
pdfConf
}
val tesseractOCRConfig: TesseractOCRConfig = {
val tessConf = new TesseractOCRConfig()
tessConf.setLanguage("eng+swe")
tessConf.setEnableImageProcessing(1)
tessConf.setResize(100) // 100-900 - lower faster.
// tessConf.setApplyRotation(true)
tessConf
}
Any help highly appreciated!
It is also an important property in pdf config to skip/include internal images processing
pdfConf.setExtractInlineImages(true) //for the scanned pdf setting it to false has no sense
In the TesseractOCRConfig the usefil is also setTimeout()
I am going to test Spark's RDD cache by running PythonPageRank on CentOS 7:
spark-submit --master yarn --deploy-mode cluster /usr/spark/examples/src/main/python/pagerank.py input/testpr.txt 10
As you can see, I am doing the PageRank, therefore testpr.txt and 10 are the parameters.
The file pagerank.py contains the following code:
spark = SparkSession\
.builder\
.appName("PythonPageRank")\
.getOrCreate()
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
links = lines.map(lambda urls: parseNeighbors(urls)).distinct().groupByKey().cache()
ranks = links.map(lambda url_neighbors: (url_neighbors[0], 1.0))
for iteration in range(int(sys.argv[2])):
contribs = links.join(ranks).flatMap(
lambda url_urls_rank: computeContribs(url_urls_rank[1][0], url_urls_rank[1][1]))
ranks = contribs.reduceByKey(add).mapValues(lambda rank: rank * 0.85 + 0.15)
for (link, rank) in ranks.collect():
print("%s has rank: %s." % (link, rank))
spark.stop()
As you can see,links = lines.map(lambda urls: parseNeighbors(urls)).distinct().groupByKey().cache() contains cache. However, when I look at the Spark UI's Storage page, I can't find anything about cache.
Here is the PageRank application, it works well.
Here is the Job page of the application, the action collect() generates a job:
Here is the Stage page of the application, it shows that there contains many iterations in PageRank.
Here is the Storage page of the application, which should contain cached RDDs. However, it contains nothing, seeming that the cache() doesn't work.
Why can't I see any cached RDDs on the Storage page? Why doesn't the cache() in pagerank.py work? Hope someone can help me.
You can add spark.eventLog.logBlockUpdates.enabled true into spark-defaults.conf, which won't make the Spark History Server's Storage tab be blank.
I am trying to ingest data to solr using scala and spark however, my code is missing something. For instance, I got below code from Hortonworks tutorial.
I am using spark 1.6.2, solr 5.2.1, scala 2.10.5.
Can anybody provide me a workable snippet to successfully insert data into solr?
val input_file = "hdfs:///tmp/your_text_file"
case class Person(id: Int, name: String)
val people_df1 = sc.textFile(input_file).map(_.split(",")).map(p => Person(p(0).trim.toInt, p(1))).toDF()
val docs = people_df1.map{doc=>
val docx=SolrSupport.autoMapToSolrInputDoc(doc.getAs[Int]("id").toString, doc, null)
docx.setField("scala_s", "supercool")
docx.setField("name_s", doc.getAs[String]("name"))
}
// below code has compilation issue somehow although jar file doest contain these functions.
SolrSupport.indexDocs("sandbox.hortonworks.com:2181","testsparksolr",10,docs)
val solrServer = com.lucidworks.spark.SolrSupport.getSolrServer("http://ambari.asiacell.com:2181")
solrServer.setDefaultCollection("
testsparksolr")
solrServer.commit(false, false)
thanks in advance
Have you tried spark-solr?
The library's main focus is to provide a clean API to index documents to a Solr server as in your case.
I was using an old version of Flink. I upgrade to 1.2.0 and I have some issues with filters.
I have a DataStream of Log which works just fine :
val logs: DataStream[Log] = env.addSource(new LogSource(
data, delay, factor))
// DISPLAY TUPLE IN CONSOLE
logs.print()
// EXECUTE SCRIPT
env.execute("stream")
I have of course read the documentation which shows :
dataStream.filter { _ != 0 }
I tried a bunch of things like this :
val cleanLogs = logs.filter { _.isComplete }
But I got the following error :
Type mismatch, expected: FilterFunction[Log], actual: (Any) => An
So I don't see the link between the documentation and this error.
Any help ? Examples ?
Thanks
The problem was first a wrong import of StreamExecutionEnvironment which lead to this problem with basic functions like filter.
Then as I used an old version of Flink I was using LocalExecutionEnvironment class which is no longer available in Flink 1.X.
Instead : StreamExecutionEnvironment.createLocalEnvironment(1)
I want to know how can I do following things in scala?
Connect to a postgreSQL database.
Write SQL queries like SELECT , UPDATE etc. to modify a table in that database.
I know that in python I can do it using PygreSQL but how to do these things in scala?
You need to add dependency "org.postgresql" % "postgresql" % "9.3-1102-jdbc41" in build.sbt and you can modify following code to connect and query database. Replace DB_USER with your db user and DB_NAME as your db name.
import java.sql.{Connection, DriverManager, ResultSet}
object pgconn extends App {
println("Postgres connector")
classOf[org.postgresql.Driver]
val con_st = "jdbc:postgresql://localhost:5432/DB_NAME?user=DB_USER"
val conn = DriverManager.getConnection(con_str)
try {
val stm = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)
val rs = stm.executeQuery("SELECT * from Users")
while(rs.next) {
println(rs.getString("quote"))
}
} finally {
conn.close()
}
}
I would recommend having a look at Doobie.
This chapter in the "Book of Doobie" gives a good sense of what your code will look like if you make use of this library.
This is the library of choice right now to solve this problem if you are interested in the pure FP side of Scala, i.e. scalaz, scalaz-stream (probably fs2 and cats soon) and referential transparency in general.
It's worth nothing that Doobie is NOT an ORM. At its core, it's simply a nicer, higher-level API over JDBC.
Take look at the tutorial "Using Scala with JDBC to connect to MySQL", replace the db url and add the right jdbc library. The link got broken so here's the content of the blog:
Using Scala with JDBC to connect to MySQL
A howto on connecting Scala to a MySQL database using JDBC. There are a number of database libraries for Scala, but I ran into a problem getting most of them to work. I attempted to use scala.dbc, scala.dbc2, Scala Query and Querulous but either they aren’t supported, have a very limited featured set or abstracts SQL to a weird pseudo language.
The Play Framework has a new database library called ANorm which tries to keep the interface to basic SQL but with a slight improved scala interface. The jury is still out for me, only used on one project minimally so far. Also, I’ve only seen it work within a Play app, does not look like it can be extracted out too easily.
So I ended up going with basic Java JDBC connection and it turns out to be a fairly easy solution.
Here is the code for accessing a database using Scala and JDBC. You need to change the connection string parameters and modify the query for your database. This example was geared towards MySQL, but any Java JDBC driver should work the same with Scala.
Basic Query
import java.sql.{Connection, DriverManager, ResultSet};
// Change to Your Database Config
val conn_str = "jdbc:mysql://localhost:3306/DBNAME?user=DBUSER&password=DBPWD"
// Load the driver
classOf[com.mysql.jdbc.Driver]
// Setup the connection
val conn = DriverManager.getConnection(conn_str)
try {
// Configure to be Read Only
val statement = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)
// Execute Query
val rs = statement.executeQuery("SELECT quote FROM quotes LIMIT 5")
// Iterate Over ResultSet
while (rs.next) {
println(rs.getString("quote"))
}
}
finally {
conn.close
}
You will need to download the mysql-connector jar.
Or if you are using maven, the pom snippets to load the mysql connector, you’ll need to check what the latest version is.
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.12</version>
</dependency>
To run the example, save the following to a file (query_test.scala) and run using, the following specifying the classpath to the connector jar:
scala -cp mysql-connector-java-5.1.12.jar:. query_test.scala
Insert, Update and Delete
To perform an insert, update or delete you need to create an updatable statement object. The execute command is slightly different and you will most likely want to use some sort of parameters. Here’s an example doing an insert using jdbc and scala with parameters.
// create database connection
val dbc = "jdbc:mysql://localhost:3306/DBNAME?user=DBUSER&password=DBPWD"
classOf[com.mysql.jdbc.Driver]
val conn = DriverManager.getConnection(dbc)
val statement = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_UPDATABLE)
// do database insert
try {
val prep = conn.prepareStatement("INSERT INTO quotes (quote, author) VALUES (?, ?) ")
prep.setString(1, "Nothing great was ever achieved without enthusiasm.")
prep.setString(2, "Ralph Waldo Emerson")
prep.executeUpdate
}
finally {
conn.close
}
We are using Squeryl, which is working well so far for us. Depending on your needs it may do the trick.
Here is a list of supported DB's and the adapters
If you want/need to write your own SQL, but hate the JDBC interface, take a look at O/R Broker
I would recommend the Quill query library. Here is an introduction post by Li Haoyi to get started.
TL;DR
{
import io.getquill._
import com.zaxxer.hikari.{HikariConfig, HikariDataSource}
val pgDataSource = new org.postgresql.ds.PGSimpleDataSource()
pgDataSource.setUser("postgres")
pgDataSource.setPassword("example")
val config = new HikariConfig()
config.setDataSource(pgDataSource)
val ctx = new PostgresJdbcContext(LowerCase, new HikariDataSource(config))
import ctx._
}
Define a class ORM:
// mapping `city` table
case class City(
id: Int,
name: String,
countryCode: String,
district: String,
population: Int
)
and query all items:
# ctx.run(query[City])
cmd11.sc:1: SELECT x.id, x.name, x.countrycode, x.district, x.population FROM city x
val res11 = ctx.run(query[City])
^
res11: List[City] = List(
City(1, "Kabul", "AFG", "Kabol", 1780000),
City(2, "Qandahar", "AFG", "Qandahar", 237500),
City(3, "Herat", "AFG", "Herat", 186800),
...
ScalikeJDBC is quite easy to use. It allows you to write raw SQL using interpolated strings.
http://scalikejdbc.org/