how to save HashSet into a plain text file in Scala - scala

For example, I have a HashSet and want to save it to file like txt or csv and so on.
val slotidSet: util.HashSet[String] = new util.HashSet[String](1)
slotidSet.add("100")
slotidSet.add("105")
slotidSet.add("102")
slotidSet.add("103")
How to save this HashSet into a plain text file?
Thanks in advance.

Something like this should do the work
import java.nio.file.{Files, Paths}
import java.util
Files.write(Paths.get("file.txt"), slotidSet.asScala.mkString(",").getBytes)
Output is
100,102,103,105
Now it's just up to you to choose format, in this case all the elements are just concatenated with ,. In cases where you need to work with files in Scala/Java, Java NIO is a good place to start.

Related

How to use a annotator in sparknlp for a text file

As i am beginner to spark NLP, I started to do some hands on exercises using the functions which are displayed in the johnsnowlabs
I am using SCALA from data bricks and i got a large text file from https://www.gutenberg.org/
So first I import necessary libraries and data as follows,
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val book = sc.textFile("/FileStore/tables/84_0-5b1ef.txt").collect()
val words=bookRDD.filter(x=>x.length>0).flatMap(line => line.split("""\W+"""))
val rddD = words.toDF("text")
How to use different Annotators which are available in johnsnowlabs based on my purpose ?
For example if I want to find stop-words, then I can use
val stopWordsCleaner = new StopWordsCleaner()
.setInputCols("token")
.setOutputCol("cleanTokens")
.setStopWords(Array("this", "is", "and"))
.setCaseSensitive(false)
But I have no idea about how to use this and find stop words of my text file. Do i need to use a pre-trained model with the annotator ?
I found very difficult to find a good tutorial about this. So it is grateful if someone can provide some useful hints.
StopWordsCleaner is the annotator to use to remove stop words.
Refer: Annotators
Stop Words maybe different for your text based on your context but generally all NLP Engines have a set of stop words which it would match and remove.
In JSL spark-nlp, you may also set your stop words using setStopWords while using StopWordsCleaner.

Spark - Get from a directory with nested folders all filenames of a particular data type

I have a directory with some subfolders which content different parquet files. Something like this:
2017-09-05
10-00
part00000.parquet
part00001.parquet
11-00
part00000.parquet
part00001.parquet
12-00
part00000.parquet
part00001.parquet
What I want is by passing the path to the directory 05-09 to get a list of names of all parquet files.
I was able to achieve it, but in a very inefficient way:
val allParquetFiles = sc.wholeTextFiles("C:/MyDocs/2017-09-05/*/*.parquet")
allParquetFiles.keys.foreach((k) => println("The path to the file is: "+k))
So each key is the name I am looking for, but this process requires me to load all files as well, which then I can't use, since I get them in binary (and I don't know how to convert them into a dataframe).
Once I have the keys (so the list of filePaths) I am planning to invoke:
val myParquetDF = sqlContext.read.parquet(filePath);
As you may have already understood I am quite new in Spark. So please if there is a faster or easier approach to read a list of parquet files located in different folders, please let me know.
My Partial Solution: I wasn't able to get all paths for all filenames in a folder, but I was able to get the content of all files of that type into the same dataframe. Which was my ultimate goal. In case someone may need it in the future, I used the following line:
val df = sqlContext.read.parquet("C:/MyDocs/2017-05-09/*/*.parquet")
Thanks for your time
You can do it using the hdfs api like this
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
val fs = FileSystem.get(new Configuration())
val files = ( fs.listStatus(new Path("C:/MyDocs/2017-09-05/*/*.parquet")) ).map(_.getPath.toString)
First, it is better to avoid using wholeTextFiles. This method reads the whole file at once. Try to use textFile method. read more
Second, if you need to get all files recursively in one directory, you can achieve it by textFile method:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
This configuration will enable recursive search (works for spark jobs as for mapreduce jobs). And then just invoke sc.textFile(path).

export als recommendation model to a file

I am new to Apache Spark. I ran the sample ALS algorithm code present in the examples folder. I gave a csv file as an input. When I use model.save(path) to save the model, it is stored in gz.parquet file.
When I tried to open this file, I get these errors
Now I want to store the recommendation model generated in a text or csv file for using it outside Spark.
I tried the following function to store the model generated in a file but it was useless:
model.saveAsTextFile("path")
Please suggest me a way to overcome this issue.
Lest say you have trained your model with something like this:
val model = ALS.train(ratings, rank, numIterations, 0.01)
All that you have to do is:
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
// Save
model.save(sc, "yourpath/yourmodel")
// Load Model
val sameModel = MatrixFactorizationModel.load(sc, "yourpath/yourmodel")
As it turns out saveAsTextFile() only works on the slaves.Use collect() to collect the data from the slaves so it can be saved locally on the master. Solution can be found here

Exporting and importing a MATLAB map structure

I'm using the containers.Map-function to store my data.
Is there an easy way to export the whole structure to a file and be able to import it again at a later time.
A structure could be:
keys = {'six','seven','eight','nine'};
vals = {6,7,8,9};
Map = containers.Map(keys,vals);
and then say I want to export this structure and be able to import at a later time (or in another code).
Thanks, regards
Rasmus
Use the save and load functions: http://www.mathworks.com/help/matlab/matlab_env/save-load-and-delete-workspace-variables.html:
save('MyFileName', 'Map');
load('MyFileName');

import csv into sugarCRM

I am trying to import a csv file into sugarCRM but on step 2 my data looks like: ;cqà,ý¼nÉBÏÛï÷£ýd$ÕÆóWHkÂQËrÅTyÀÁ
I have just no idea whatsoever what to do. I've tried researching how to import and I am just not seeing anything that helps me with my problem.
Try setting your input file format to UTF-8 and see if that solves the problem. Sounds like a problem with file encoding...