How to use cedilla in hortonworks - scala

I have a framework written in scala which reads a json and processes it based on the input in the json.
Val fileContent=scala.io.Source.fromFile("filename")
val jsonText=fileContent.getLines.mkstring("\n")
val parseJson=JSON.parseFull(jsonText).get.asInstanceOf[Map[String,Any]]
The input json has a section to extract a file and place it in the path with file delimiter as CEDILLA.
The json is parsed correctly in mapR evrytime. However, it fails in hortonworks intermittently.
Can anyone help me know where this could be failing.

Related

Load CSV file as dataframe from resources within an Uber Jar

So, I made an Scala Application to run in Spark, and created the Uber Jar using sbt> assembly.
The file I load is a lookup needed by the application, thus the idea is to package it together. It works fine from within InteliJ using the path "src/main/resources/lookup01.csv"
I am developing in Windows, testing locally, to after deploy it to a remote test server.
But when I call spark-submit on the Windows machine, I get the error :
"org.apache.spark.sql.AnalysisException: Path does not exist: file:/H:/dev/Spark/spark-2.4.3-bin-hadoop2.7/bin/src/main/resources/"
Seems it tries to find the file in the sparkhome location instead of from inside the JAr file.
How could I express the Path so it works looking the file from within the JAR package?
Example code of the way I load the Dataframe. After loading it I transform it into other structures like Maps.
val v_lookup = sparkSession.read.option( "header", true ).csv( "src/main/resources/lookup01.csv")
What I would like to achieve is getting as way to express the path so it works in every environment I try to run the JAR, ideally working also from within InteliJ while developing.
Edit: scala version is 2.11.12
Update:
Seems that to get a hand in the file inside the JAR, I have to read it as a stream, the bellow code worked, but I cant figure out a secure way to extract the headers of the file such as SparkSession.read.option has.
val fileStream = scala.io.Source.getClass.getResourceAsStream("/lookup01.csv")
val inputDF = sparkSession.sparkContext.makeRDD(scala.io.Source.fromInputStream(fileStream).getLines().toList).toDF
When the makeRDD is applied, I get the RDD and then can convert it to a dataframe, but it seems I lost the ability tu use the option from "read" that parsed out the headers as the schema.
Any way around it when using makeRDD ?
Other problem with this is that seems that I will have to manually parse the lines into columns.
You have to get the correct path from classPath
Considering that your file is under src/main/resources:
val path = getClass.getResource("/lookup01.csv")
val v_lookup = sparkSession.read.option( "header", true ).csv(path)
So, it all points to that after the file is inside JAR, it can only be accessed as a inputstream to read the chunk of data from within the compressed file.
I arrived at a solution, even though its not pretty it does what I need, that is to read a csv file, take the 2 first columns and make it into a dataframe and after load it inside a key-value structure (in this case i created a case class to hold these pairs).
I am considering migrating these lookups to a HOCON file, that may make the process less convoluted to load these lookups
import sparkSession.implicits._
val fileStream = scala.io.Source.getClass.getResourceAsStream("/lookup01.csv")
val input = sparkSession.sparkContext.makeRDD(scala.io.Source.fromInputStream(fileStream).getLines().toList).toDF()
val myRdd = input.map {
line =>
val col = utils.Utils.splitCSVString(line.getString(0))
KeyValue(col(0), col(1))
}
val myDF = myRdd.rdd.map(x => (x.key, x.value)).collectAsMap()
fileStream.close()

Read a fastparquet file using Akka parquet

I have one of our Python systems generating Parquet files using Pandas and fastparquet. These are to be read by a Scala system that runs atop Akka streams.
Akka does provide a source for reading Avro Parquet files. However, when I try to read the file, I end up with
java.lang.IllegalArgumentException: INT96 not yet implemented.
This is one of the columns that does not need to be read for the Scala application to work. My question is whether I can specify a schema and get just that one column out considering that the generated file is from fastparquet.
The relevant snippet which generates a source for reading Parquet files is:
.map(result => {
val path = s"s3a://${result.bucketName}/${result.key}"
val file = HadoopInputFile.fromPath(new Path(path), hadoopConfig)
val reader: ParquetReader[GenericRecord] =
AvroParquetReader
.builder[GenericRecord](file)
.withConf(hadoopConfig)
.build()
AvroParquetSource(reader)
})

Scala Spark and Twitter feed

I am following some code that connects to twitter and then writes out that data to a local text file. Here is my code:
System.setProperty("twitter4j.oauth.consumerKey", "Mycode - Not going to put real one in for obvious reasons")
System.setProperty("twitter4j.oauth.consumerSecret", "Mycode")
System.setProperty("twitter4j.oauth.accessToken", "Mycode")
System.setProperty("twitter4j.oauth.accessTokenSecret", "Mycode")
val ssc = new StreamingContext(spark.sparkContext, Seconds(5))
val twitterStream = TwitterUtils.createStream(ssc, None)
twitterStream.saveAsTextFiles("streamouts/tweets", "txt")
ssc.start()
Thread.sleep(30000)
ssc.stop(false)
Now, the code is not complaining about any missing references or anything. I believe I have the correct SBT dependencies.
The following code seems to run. It creates the folder structure and text files within. However, ALL of the text files are completely blank. 0kb in size.
What am i doing wrong? Anyone any ideas, as to why it look likes it is creating the output text files, but not actually writing into the files?
By the way:
I have triple checked the consumer keys, access tokens etc from the twitter app. I'm certain I have copied them over correctly.
Conor
The code looks fine in your case.
why it look likes it is creating the output text files, but not actually writing into the files?
As per here new StreamingContext(spark.sparkContext, Seconds(5))
For each interval of 5 seconds, it collects the data that are in and creates an RDD, So each RDD are written with prefix and suffix that you pass in saveAsTextFiles
So your files may be empty in case your RDD is empty otherwise look in the files that are generated inside the folder as part-00000, part-00001, part-00002 should contain data and not in _SUCCESS and .part-00000.crc
I hope this helps you,

Loadiing a trained Word2Vec model in Spark

I am trying to load google's Pre-trained vectors 'GoogleNews-vectors-negative300.bin.gz' Google-word2vec into spark.
I converted the bin file to txt and created a smaller chunk for testing that I called 'vectors.txt'. I tried loading it as the following:
val sparkSession = SparkSession.builder
.master("local[*]")
.appName("Word2VecExample")
.getOrCreate()
val model2= Word2VecModel.load(sparkSession.sparkContext, "src/main/resources/vectors.txt")
val synonyms = model2.findSynonyms("the", 5)
for((synonym, cosineSimilarity) <- synonyms) {
println(s"$synonym $cosineSimilarity")
}
and to my surprise I am faced with the following error:
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/elievex/Repository/ARCANA/src/main/resources/vectors.txt/metadata
I'm not sure where did the 'metadata' after 'vectors.txt' came from.
I am using Spark, Scala and Scala IDE for Eclipse.
What am I doing wrong? is there a different way to load a pre-trained model in spark? Would appreciate any tips.
How exactly did you get vector.txt? If you read JavaDoc for Word2VecModel.save you may see that:
This saves: - human-readable (JSON) model metadata to path/metadata/ - Parquet formatted data to path/data/
The model may be loaded using Loader.load.
So what you need is model in Parquet format which is standard for Spark ML models.
Unfortunately load from Google's native format has not been implemented yet (see SPARK-9484).

Spark/Scala Opening Zipped CSV Files

I am new to Spark and Scala. We have ad event log files formatted as CSV's and then compressed using pkzip. I have seen many examples on how to decompress zipped files using Java, but how would I do this using Scala for Spark? We, ultimately, want to get, extract, and load the data from each incoming file into an Hbase destination table. Maybe this can this be done with the HadoopRDD? After this, we are going to introduce Spark streaming to watch for these files.
Thanks,
Ben
Default compression support
#samthebest answer is correct, if you are using compression format that is by default available in Spark (Hadoop). Which are:
bzip2
gzip
lz4
snappy
I have explained this topic deeper in my other answer: https://stackoverflow.com/a/45958182/1549135
Reading zip
However, if you are trying to read a zip file you need to create a custom solution. One is mentioned in the answer I have already provided.
If you need to read multiple files from your archive, you might be interested in the answer I have provided: https://stackoverflow.com/a/45958458/1549135
Basically, all the time, using sc.binaryFiles and later on decompressing the PortableDataStream, like in the sample:
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}
In Spark, provided your files have the correct filename suffix (e.g. .gz for gzipped), and it's supported by org.apache.hadoop.io.compress.CompressionCodecFactory, then you can just use
sc.textFile(path)
UPDATE: At time of writing their is a bug in Hadoop bzip2 library which means trying to read bzip2 files using spark results in weird exceptions - usually ArrayIndexOutOfBounds.