flink what is the equivalent to parseQuotedStrings in scala api - scala

I am trying to convert this jave code to scala:
DataSet<Tuple3<Long, String, String>> lines = env.readCsvFile("movies.csv")
.ignoreFirstLine()
.parseQuotedStrings('"')
.ignoreInvalidLines()
.types(Long.class, String.class, String.class);
to scala. I couldn't find any alternative in scala to parseQuotedStrings I will appreciate any assistance here

This is following code uses flink's java api, literal translation of the code provided by you.
import org.apache.flink.api.java._
val env = ExecutionEnvironment.getExecutionEnvironment
val movies = env.readCsvFile("movies.csv")
.ignoreFirstLine()
.parseQuotedStrings('"')
.ignoreInvalidLines()
.types(classOf[Long], classOf[String], classOf[String])
Also you can use flink's scala api, something like this
import org.apache.flink.api.scala._
val env = ExecutionEnvironment.getExecutionEnvironment
val movies = env.readCsvFile[(Int,String,String)]
("movies.csv", ignoreFirstLine = true, quoteCharacter = '"', lenient = true)
AFAIK Scala api does not have the fluent api of the java version. "lenient" options is same as "ignoreInvalidLines" and the other options should be self explanatory.

Related

Overloaded method value json with alternatives: (jsonRDD: org.apache.spark.rdd.RDD[String]) using Spark in IntelliJ

I am trying to convert a JSON String jsonStr into a Spark Dataframe in Scala. Using InteliJ for this purpose.
val spark = SparkSession.builder().appName("SparkExample").master("local[*]").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
var df = spark.read.json(Seq(jsonStr).toDS)
df.show()
Getting the following error while compilation/ building the project using Maven.
Error:(243, 29) overloaded method value json with alternatives:
(jsonRDD:
org.apache.spark.rdd.RDD[String])org.apache.spark.sql.DataFrame
(jsonRDD:
org.apache.spark.api.java.JavaRDD[String])org.apache.spark.sql.DataFrame
(paths: String*)org.apache.spark.sql.DataFrame (path:
String)org.apache.spark.sql.DataFrame cannot be applied to
(org.apache.spark.sql.Dataset[String])
var df = spark.read.json(Seq(jsonStr).toDS)
Note: I was not getting the error while building with SBT.
Below method was introduced in Spark 2.2.0
def json(jsonDataset: Dataset[String]): DataFrame =
Please correct your version of Spark in maven's pom.xml file
Change your code to
val rdd = sc.parallelize(Seq(jsonStr))
var json_df = spark.read.json(rdd)
var df = json_df.toDS
(or combine it in one variable, up to you)
.
You're trying to pass Dataset into spark.read.json function, hence the error.

Can we create a xml file with specific node with Spark Scala?

I have another question about Spark and Scala. I want to use that technologie to get data and generate a xml.
Therefore, I want to know if it is possible to create node ourself (not automatic creation) and what library can we use ? I search but I found nothing very interesting(Like I'm new in this technologie, I don't know many keywords).
I want to know if there is in Spark something like this code (I write that in scala. It works in local but I can't use new File() in Spark).
val docBuilder: DocumentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder()
val document = docBuilder.newDocument()
ar root:Element = document.createElement("<name Balise>")
attr = document.createAttribute("<attr1>")
attr.setValue("<value attr1>")
root.setAttributeNode(<attr>)
attr = document.createAttribute("<attr2>")
attr.setValue("<value attr2>")
root.setAttributeNode(attr)
document.appendChild(root)
document.setXmlStandalone(true)
var transformerFactory:TransformerFactory = TransformerFactory.newInstance()
var transformer:Transformer = transformerFactory.newTransformer()
var domSource:DOMSource = new DOMSource(document)
var streamResult:StreamResult = new StreamResult(new File(destination))
transformer.transform(domSource,streamResult)
I want to know if it's possible to do that with spark.
Thanks for your answer and have a good day.
Not exactly, but you can do something similar by using Spark XML API pr XStream API on Spark.
First try using Spark XML API which is most useful when reading and writing XML files using Spark. However, At the time of writing this, Spark XML has following limitations.
1) Adding attribute to root element has not supported.
2) Does not support following structure where you have header and footer elements.
<parent>
<header></header>
<dataset>
<data attr="1"> suports xml tags and data here</data>
<data attr="2">value2</data>
</dataset>
<footer></footer>
</parent>
If you have one root element and following data then Spark XML is go to api.
Alternatively, you can look at XStream API. Below are steps how to use it to create custom XML structures.
1) First, create a Scala class similar to the structure you wanted in XML.
case class XMLData(name:String, value:String, attr:String)
2) Create an instance of this class
val data = XMLData("bookName","AnyValue", "AttributeValue")
3) Conver data object to XML using XStream API. If you already have data in a DataFrame, then do a map transformation to convert data to an XML string and store it back in DataFrame. if you do so, then you can skip step #4
val xstream = new XStream(new DomDriver)
val xmlString = xstream.toXML(data)
4) Now convert xmlString to DataFrame
val df = xmlString.toDF()
5) Finally, write to a file
df.write.text("file://filename")
Here isa full sample example with XStream API
import com.thoughtworks.xstream.XStream
import com.thoughtworks.xstream.io.xml.DomDriver
import org.apache.spark.sql.SparkSession
case class Animal(cri:String,taille:Int)
object SparkXMLUsingXStream{
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder.master ("local[*]")
.appName ("sparkbyexamples.com")
.getOrCreate ()
var animal:Animal = Animal("Rugissement",150)
val xstream1 = new XStream(new DomDriver())
xstream1.alias("testAni",classOf[Animal])
xstream1.aliasField("cricri",classOf[Animal],"cri")
val xmlString = Seq(xstream1.toXML(animal))
import spark.implicits._
val newDf = xmlString.toDF()
newDf.show(false)
}
}
Hope this helps !!
Thanks

Avro support in Flink - scala

How to read avro from Flink in scala?
Is it the same for batch/stream/table: StreamExecutionEnvironment/ ExecutionEnvironment / TableEnvironment?
would it be sth like: val custTS: TableSource = new AvroInputFormat("/path/to/file", ...)
Below is java avro implementation ref (connectors), but can't find scala ref anywhere:
AvroInputFormat<User> users = new AvroInputFormat<User>(in, User.class);
DataSet<User> usersDS = env.createInput(users);
You can use Flink's InputFormats, including the AvroInputFormat, from the Java as well as the Scala API:
Streaming & batch: val avroInputStream = env.createInput(new AvroInputFormat[User](in, classOf[User]))
Table API: tableEnv.registerTable("table", avroInputStream.toTable)

loading csv file to HBase through Spark

this is simple " how to " question::
We can bring data to Spark environment through com.databricks.spark.csv. I do know how to create HBase table through spark, and write data to the HBase tables manually. But is that even possible to load a text/csv/jason files directly to HBase through Spark? I cannot see anybody talking about it. So, just checking. If possible, please guide me to a good website that explains the scala code in detail to get it done.
Thank you,
There are multiple ways you can do that.
Spark Hbase connector:
https://github.com/hortonworks-spark/shc
You can see lot of examples on the link.
Also you can use SPark core to load the data to Hbase using HbaseConfiguration.
Code Example:
val fileRDD = sc.textFile(args(0), 2)
val transformedRDD = fileRDD.map { line => convertToKeyValuePairs(line) }
val conf = HBaseConfiguration.create()
conf.set(TableOutputFormat.OUTPUT_TABLE, "tableName")
conf.set("hbase.zookeeper.quorum", "localhost:2181")
conf.set("hbase.master", "localhost:60000")
conf.set("fs.default.name", "hdfs://localhost:8020")
conf.set("hbase.rootdir", "/hbase")
val jobConf = new Configuration(conf)
jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
jobConf.set("mapreduce.job.output.value.class", classOf[LongWritable].getName)
jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName)
transformedRDD.saveAsNewAPIHadoopDataset(jobConf)
def convertToKeyValuePairs(line: String): (ImmutableBytesWritable, Put) = {
val cfDataBytes = Bytes.toBytes("cf")
val rowkey = Bytes.toBytes(line.split("\\|")(1))
val put = new Put(rowkey)
put.add(cfDataBytes, Bytes.toBytes("PaymentDate"), Bytes.toBytes(line.split("|")(0)))
put.add(cfDataBytes, Bytes.toBytes("PaymentNumber"), Bytes.toBytes(line.split("|")(1)))
put.add(cfDataBytes, Bytes.toBytes("VendorName"), Bytes.toBytes(line.split("|")(2)))
put.add(cfDataBytes, Bytes.toBytes("Category"), Bytes.toBytes(line.split("|")(3)))
put.add(cfDataBytes, Bytes.toBytes("Amount"), Bytes.toBytes(line.split("|")(4)))
return (new ImmutableBytesWritable(rowkey), put)
}
Also you can use this one
https://github.com/nerdammer/spark-hbase-connector

How do I set Jackson parser features when using json4s?

I am receiving the following error while attempting to parse JSON with json4s:
Non-standard token 'NaN': enable JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS to allow
How do I enable this feature?
Assuming your ObjectMapper object is named mapper:
val mapper = new ObjectMapper()
// Configure NaN here
mapper.configure(JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS, true)
...
val json = ... //Get your json
val imported = mapper.readValue(json, classOf[Thing]) // Thing being whatever class you're importing to.
#Nathaniel Ford, thanks for setting me on the right path!
I ended up looking at the source code for the parse() method (which is what I should have done in the first place). This works:
import com.fasterxml.jackson.core.JsonParser
import com.fasterxml.jackson.databind.ObjectMapper
import org.json4s._
import org.json4s.jackson.Json4sScalaModule
val jsonString = """{"price": NaN}"""
val mapper = new ObjectMapper()
// Configure NaN here
mapper.configure(JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS, true)
mapper.registerModule(new Json4sScalaModule)
val json = mapper.readValue(jsonString, classOf[JValue])
While the answers above are still correct, what should be amended is, that since Jackson 2.10 JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS is deprecated.
The sustainable way for configuring correct NaN handling is the following:
val mapper = JsonMapper.builder().enable(JsonReadFeature.ALLOW_NON_NUMERIC_NUMBERS).build();
// now your parsing