Compress Output Scalding / Cascading TsvCompressed

Compress Output Scalding / Cascading TsvCompressed - scala

So people have been having problems compressing the output of Scalding Jobs including myself. After googling I get the odd hiff of an answer in a some obscure forum somewhere but nothing suitable for peoples copy and paste needs.
I would like an output like Tsv, but writes compressed output.

Anyway after much faffification I managed to write a TsvCompressed output which seems to do the job (you still need to set the hadoop job system configuration properties, i.e. set compress to true, and set the codec to something sensible or it defaults to crappy deflate)
import com.twitter.scalding._
import cascading.tuple.Fields
import cascading.scheme.local
import cascading.scheme.hadoop.{TextLine, TextDelimited}
import cascading.scheme.Scheme
import org.apache.hadoop.mapred.{OutputCollector, RecordReader, JobConf}
case class TsvCompressed(p: String) extends FixedPathSource(p) with DelimitedSchemeCompressed
trait DelimitedSchemeCompressed extends Source {
val types: Array[Class[_]] = null
override def localScheme = new local.TextDelimited(Fields.ALL, false, false, "\t", types)
override def hdfsScheme = {
val temp = new TextDelimited(Fields.ALL, false, false, "\t", types)
temp.setSinkCompression(TextLine.Compress.ENABLE)
temp.asInstanceOf[Scheme[JobConf,RecordReader[_,_],OutputCollector[_,_],_,_]]
}
}

I have also small project showing how to achieve compressed output from Tsv. WordCount-Compressed.
Scalding was setting null to the Cascading TextDelimeted parameter which disables compression.

Related

Pureconfig - is it possible to include in conf file another conf file?

Is it possible to include in *conf file another conf file?
Current implementation:
// db-writer.conf
writer: {
name="DatabaseWriter",
model="model1",
table-name="model1",
append=false,
create-table-file="sql/create_table_model1.sql",
source-file="abcd.csv"
}
Desired solution:
// model1.conf + others model2.conf, model3.conf..
table: {
name="model1",
table-name="model1",
create-table-file="../sql/create_table_model1.sql"
}
//db-writer.conf
import model1.conf <=== some import?
writer: {
name="DatabaseWriter",
model="model1", <=== some reference like this?
append=false,
source-file="abcd.csv"
}
Reason why I would like to have it like this is :
to reduce duplicated definitions
to pre-define user conf file which are rare modified
I guess it is not possible - if not do you have any suggestion how to separate configs & reuse them?
I'm using scala 2.12 lang and pureconfig 0.14 (can be updated to any newer)

Pureconfig uses HOCON (though some of the interpretation of things like durations differ). HOCON include is supported.
So assuming that you have model1.conf in your resources (e.g. src/main/resources), all you need in db-writer.conf is
include "model1"
HOCON-style overrides and concatenation are also supported:
writer: ${table} {
name = "DatabaseWriter"
model = "model1"
append = false
source-file = "abcd"
}

Odoo Modify or extend import process

I want modify or extend the process on csv import in Odoo.
I have some fields autocalculated and other needed but is not in the csv file.
Having search the code and try using ir.action.todo, and ir.action.client but dont work.
Any idea, using hooks, or other work?
Thanks
Yoinier.

You just need to inherit the 'base_import.import' model
class Import(models.TransientModel):
_inherit = 'base_import.import'
#api.model
def _convert_import_data(self, fields, options):
# Override base method
# Called when actual import start
data, import_fields = super(Import, self)._convert_import_data(fields, options)
# Do something ...
return data, import_fields
def parse_preview(self, options, count=10):
# Override base method
# Called when data loaded
preview_data = super(Import, self).parse_preview(options, count=count)
# Do something ...
return preview_data
but, override the base import method is probably not a good idea, I'd suggest to use custom import wizard to do your custom import.

Drop table using Pyspark

The SparkSession.catalog object has a bunch of methods to interact with the metastore, namely:
['cacheTable',
'clearCache',
'createExternalTable',
'createTable',
'currentDatabase',
'dropGlobalTempView',
'dropTempView',
'isCached',
'listColumns',
'listDatabases',
'listFunctions',
'listTables',
'recoverPartitions',
'refreshByPath',
'refreshTable',
'registerFunction',
'setCurrentDatabase',
'uncacheTable']
Unfortunately, there seems to have no programmatic way to drop a table.
There are multiple ways to achieve this like
spark.sql(f"drop table my_table")
or
spark._jsparkSession.sharedState().externalCatalog().dropTable(db, table, True, True)
but they look a little bit hackish compared to a simple, nonetheless missing, dropTable method?
Is there a better way ?

AFAIK from the above approaches mentioned are most commonly used ones. No other way I feel..
But alternative way I can see from these docs...
you might try this org.apache.spark.sql.hive.HiveUtils which has goodies (to drop tables..) for you.
I am not so good in python, you can see below scala example and follow the same way for python.
package org.apache.spark.sql.hive {
import org.apache.spark.sql.hive.HiveUtils
import org.apache.spark.SparkContext
object utils {
def dropTable(sc: SparkContext, dbName: String, tableName: String, ignoreIfNotExists: Boolean, purge: Boolean): Unit = {
HiveUtils
.newClientForMetadata(sc.getConf, sc.hadoopConfiguration)
.dropTable(dbName, tableName, ignoreIfNotExists, false)
}
}
}
Caller would be like
import org.apache.spark.sql.hive.utils
utils.dropTable(sc, "default", "my_table", true, true)

roxygen2 docstrings for Reference Classes overriding base class

I have an abstract base class that looks like this:
#' An Abstract Base Class
Filter <- setRefClass(
Class = "Filter",
methods = list(
train = function(x) {
"Override this method to train any associated parameters for the filter on the supplied data"
print("no learning to be done")
})
)
and the following class that extends this class:
#' Filter class that leverages the prcomp R method.
PcaFilter <- setRefClass(
"PcaFilter",
contains="Filter",
fields=list(
d="numeric",
model="ANY"
),
methods=list(
train=function(x) {
"train PCA model, store result to model attribute of obj"
model <<- prcomp(x)
})
)
After I run
roxygen2::roxygenize()
Then I get two man files but in the man file for the second class the docstring for the overridden class does not come through- I get the docstring for the base class. Am I doing something wrong or is this a bug with roxygen2 ?
Also is there any better way of doing this? I would like to be able to use multi-line docstrings.

Having searched through the Issues on the roxygen github repo. Found that there's already an active Issue pertaining to this:
https://github.com/klutometis/roxygen/issues/433
In summary: the support and documentation for Reference Classes in roxygen is not great as of v5.0. The suggested method is still to use docstrings and it's impossible to override the docstrings of parents.

Scala drivers for couchdb and partial schemas

One question I have about current Scala couchdb drivers is whether they can work with "partial" schemas". I'll try to explain what I mean: the libraries I've see seem to all want to do a complete conversion from JSON docs in the database to a Scala object, handle the Scala object, and convert it back to JSON. This is is fine if your application knows everything about that type of object --- especially if it is the sole piece of software interacting with that database. However, what if I want to write a little application that only knows about part of the JSON object: for example, what if I'm only interested in a 'mybook' component embedded like this:
{
_id: "0ea56a7ec317138700743cdb740f555a",
_rev: "2-3e15c3acfc3936abf10ea4f84a0aeced",
type: "user",
profiles: {
mybook: {
key: "AGW45HWH",
secret: "g4juh43ui9hg929gk4"
},
.. 6 or 7 other profiles
},
.. lots of other stuff
}
I really don't want to convert the whole JSON AST to a Scala object. On the other hand, in couchdb, you must save back the entire JSON doc, so this needs to be preserved somehow. I think what I really what is something like this:
class MyBook {
private val userJson: JObject = ... // full JSON retrieved from the database
lazy val _id: String = ... // parsed from the JSON
lazy val _rev: String = ... // parsed from the JSON
lazy val key: String = ... // parsed from the JSON
lazy val secret: String = ... // (ditto)
def withSecret(secret: String): MyBook = ... // new object with altered userJson
def save(db: CouchDB) = ... // save userJson back to couchdb
}
Advantages:
computationally cheaper to extract only needed fields
don't have to sync with database evolution except for 'mybook' part
more suitable for development with partial schemas
safer, because there is less change as inadvertently deleting fields if we didn't keep up with the database schema
Disadavantages:
domain objects in Scala are not pristinely independent of couch/JSON
more memory use per object
Is this possible with any of the current Scala drivers? With either of scouchdb or the new Sohva library, it seems not.

As long as you have a good JSON library and a good HTTP client library, implementing a schemaless CouchDB client library is really easy.
Here is an example in Java: code, tests.

My couchDB library uses spray-json for (de)serialization, which is very flexible and would enable you to ignore parts of a document but still save it. Let's look at a simplified example:
Say we have a document like this
{
dontcare: {
...
},
important: "foo"
}
Then you could declare a class to hold information from this document and define how the conversion is done:
case class Dummy(js:JsValue)
case class PartialDoc(dontcare: Dummy, important: String)
implicit object DummyFormat extends JsonFormat[Dummy] {
override def read(js:JsValue):Dummy = Dummy(js)
override def write(d:Dummy):JsValue = d.js
}
implicit val productFormat = jsonFormat2(PartialDoc)
This will ignore anything in dontcare but still safe it as a raw JSON AST. Of course this example is not as complex as the one in your question, but it should give you an idea how to solve your problem.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Compress Output Scalding / Cascading TsvCompressed - scala

So people have been having problems compressing the output of Scalding Jobs including myself. After googling I get the odd hiff of an answer in a some obscure forum somewhere but nothing suitable for peoples copy and paste needs. I would like an output like Tsv, but writes compressed output.

I have also small project showing how to achieve compressed output from Tsv. WordCount-Compressed. Scalding was setting null to the Cascading TextDelimeted parameter which disables compression.

Related

Pureconfig - is it possible to include in conf file another conf file?

Odoo Modify or extend import process

Drop table using Pyspark

roxygen2 docstrings for Reference Classes overriding base class

Scala drivers for couchdb and partial schemas

Categories

Resources