Is it possible to load word2vec pre-trained available vectors into spark? - scala

Is there a way to load Google's or Glove's pre-trained vectors (models) such as GoogleNews-vectors-negative300.bin.gz into spark and performing operations such as findSynonyms that are provided from spark? or do I need to do the loading and operations from scratch?
In this post Load Word2Vec model in Spark , Tom Lous suggests converting the bin file to txt and starting from there, I already did that .. but then what is next?
In a question I posted yesterday I got an answer that models in Parquet format can be loaded in spark, thus I'm posting this question to be sure that there is no other option.

Disclaimer: I'm pretty new to spark, but the below at least works for me.
The trick is figuring out how to construct a Word2VecModel from a set of word vectors as well as handling some of the gotchas in trying to create the model this way.
First, load your word vectors into a Map. For example, I have saved my word vectors to a parquet format (in a folder called "wordvectors.parquet") where the "term" column holds the String word and the "vector" column holds the vector as an array[float], and I can load it like so in Java:
// Loads the dataset with the "term" column holding the word and the "vector" column
// holding the vector as an array[float]
Dataset<Row> vectorModel = pSpark.read().parquet("wordvectors.parquet");
//convert dataset to a map.
Map<String, List<Float>> vectorMap = Arrays.stream((Row[])vectorModel.collect())
.collect(Collectors.toMap(row -> row.getAs("term"), row -> row.getList(1)));
//convert to the format that the word2vec model expects float[] rather than List<Float>
Map<String, float[]> word2vecMap = vectorMap.entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getKey, entry -> (float[]) Floats.toArray(entry.getValue())));
//need to convert to scala immutable map because that's what word2vec needs
scala.collection.immutable.Map<String, float[]> scalaMap = toScalaImmutableMap(word2vecMap);
private static <K, V> scala.collection.immutable.Map<K, V> toScalaImmutableMap(Map<K, V> pFromMap) {
final List<Tuple2<K,V>> list = pFromMap.entrySet().stream()
.map(e -> Tuple2.apply(e.getKey(), e.getValue()))
.collect(Collectors.toList());
Seq<Tuple2<K,V>> scalaSeq = JavaConverters.asScalaBufferConverter(list).asScala().toSeq();
return (scala.collection.immutable.Map<K, V>) scala.collection.immutable.Map$.MODULE$.apply(scalaSeq);
}
Now you can construct the model from scratch. Due to a quirk in how Word2VecModel works, you must set the vector size manually, and do so in a weird way. Otherwise it defaults to 100 and you get an error when trying to invoke .transform(). Here is a way I've found that works, not sure if everything is necessary:
//not used for fitting, only used for setting vector size param (not sure if this is needed or if result.set is enough
Word2Vec parent = new Word2Vec();
parent.setVectorSize(300);
Word2VecModel result = new Word2VecModel("w2vmodel", new org.apache.spark.mllib.feature.Word2VecModel(scalaMap)).setParent(parent);
result.set(result.vectorSize(), 300);
Now you should be able to use result.transform() like you would with a self-trained model.
I haven't tested other Word2VecModel functions to see if they work correctly, I only tested .transform().

Related

Read csv based on header(first row in the csv) and map the value [duplicate]

I'm dealing with many CSVs files that don't have a fixed header/column, saying that I can get file1.csv with 10 column and file2.csv with 50 column.
I can't know in advance the number of column that I'll have, I can't create a specific job for each file type, my input will be a black box: bunch of CSV that will have an X number of column from 10 to infinite.
As I want to use Spring Batch to auto import these CSVs, I want to know if it is possible? I know that I have to get a fixed number of column because of the processor and the fact that I need to serialize my data into a POJO before sending it back to a writer.
Could my processor serialize an Array? beside sending one simple Object, can I get an Array of Object and in the end of my job I'll will have an Array of an Array of Object?
What do you think?
Thanks
I arrived to this old post with the very same question. Finally I managed to build a dynamic column FlatFileItemReader with the help of the skippedLinesCallback so I leave it here:
#Bean
public FlatFileItemReader<Person> reader() {
DefaultLineMapper<Person> lineMapper = new DefaultLineMapper<>();
DelimitedLineTokenizer delimitedLineTokenizer = new DelimitedLineTokenizer();
lineMapper.setLineTokenizer(delimitedLineTokenizer);
lineMapper.setFieldSetMapper(new BeanWrapperFieldSetMapper<>() {
{
setTargetType(Person.class);
}
});
return new FlatFileItemReaderBuilder<Person>()
.name("personItemReader")
.resource(new FileSystemResource(inputFile))
.linesToSkip(1)
.skippedLinesCallback(line -> delimitedLineTokenizer.setNames(line.split(",")))
.lineMapper(lineMapper)
.build();
}
In the callback method you update the names of the tokenizer from the header line. You could also add some validation logic here. With this solution there is no need to write your own LineTokenizer implementation.
Create your own LineTokenizer implementation. The DelimitedLineTokenizer expects a predefined number of columns. If you create your own, you can be as dynamic as you want. You can read more about the LineTokenizer in the documentation here: http://docs.spring.io/spring-batch/apidocs/org/springframework/batch/item/file/transform/LineTokenizer.html

Count Triangles in Scala - Spark

i am trying to get into data analytics using Spark with Scala. My question is how do i get the triangles in a graph? And i mean not the Triangle Count that comes with graphx, but the actual nodes that consist the triangle.
Suppose we have a graph file, i was able to calculate the triangles in scala, but the same technique does not apply in spark since i have to use RDD operations.
The data i give to the function is a complex List consisting of the src and the List of the destinations of that source; ex. Adj(5, List(1,2,3)), Adj(4, List(9,8,7)), ...
My scala version is this :
(Paths: List[Adj])
Paths.flatMap(i=> Paths.map(j => Paths.map(k => {
if(i.src != j.src && i.src!= k.src && j.src!=k.src){
if(i.dst.contains(j.src) && j.dst.contains(k.src) && k.dst.contains(i.src)){
println(i.src,j.src,k.src) //3 nodes that make a triangle
}
else{
()
}
}
})))
And the output would be something like:
(1,2,3)
(4,5,6)
(2,5,6)
In conclusion i want the same output but in spark environment execution. In addition i am looking for a more efficient way to hold the information about Adjacencies like key mapping and then reducing by key or something. As the spark environment needs a quite different way to approach each problem (big data operations) i would be gratefull if you could explain the way of thinking and give me a small briefing about the functions you used.
Thank you.

pyspark unindex one-hot encoded and assembled columns

I have the following code which takes in a mix of categorical, numeric features, string indexes the categorical features, then one hot encodes the categorical features, then assembles both one hot encoded categorical features and numeric features, runs them trough a random forest and prints the resultant tree. I want the tree nodes to have the original features names (i.e Frame_Size etc). How can I do that? In general how can I decode one hot encoded and assembled features?
# categorical features : start singindexing and one hot encoding
column_vec_in = ['Commodity','Frame_Size' , 'Frame_Shape', 'Frame_Color','Frame_Color_Family','Lens_Color','Frame_Material','Frame_Material_Summary','Build', 'Gender_Global', 'Gender_LC'] # frame Article_Desc not slected because the cardinality is too high
column_vec_out = ['Commodity_catVec', 'Frame_Size_catVec', 'Frame_Shape_catVec', 'Frame_Color_catVec','Frame_Color_Family_catVec','Lens_Color_catVec','Frame_Material_catVec','Frame_Material_Summary_catVec','Build_catVec', 'Gender_Global_catVec', 'Gender_LC_catVec']
indexers = [StringIndexer(inputCol=x, outputCol=x+'_tmp') for x in column_vec_in ]
encoders = [OneHotEncoder(dropLast=False, inputCol=x+"_tmp", outputCol=y) for x,y in zip(column_vec_in, column_vec_out)]
tmp = [[i,j] for i,j in zip(indexers, encoders)]
tmp = [i for sublist in tmp for i in sublist]
#categorical and numeric features
cols_now = ['SODC_Regular_Rate','Commodity_catVec', 'Frame_Size_catVec', 'Frame_Shape_catVec', 'Frame_Color_catVec','Frame_Color_Family_catVec','Lens_Color_catVec','Frame_Material_catVec','Frame_Material_Summary_catVec','Build_catVec', 'Gender_Global_catVec', 'Gender_LC_catVec']
assembler_features = VectorAssembler(inputCols=cols_now, outputCol='features')
labelIndexer = StringIndexer(inputCol='Lens_Article_Description_reduced', outputCol="label")
tmp += [assembler_features, labelIndexer]
# converter = IndexToString(inputCol="featur", outputCol="originalCategory")
# converted = converter.transform(indexed)
pipeline = Pipeline(stages=tmp)
all_data = pipeline.fit(df_random_forest_P_limited).transform(df_random_forest_P_limited)
all_data.cache()
trainingData, testData = all_data.randomSplit([0.8,0.2], seed=0)
rf = RF(labelCol='label', featuresCol='features',numTrees=10,maxBins=800)
model = rf.fit(trainingData)
print(model.toDebugString)
After I run the spark machine learning pipeline I want to print out the random forest as a tree.Currently it looks like below.
What I actually want to see is the original categorical feature names instead of feature 1, feature 2 etc. The fact that the categorical features are one hot encoded and vector assembled makes it hard for me to unindex/decode the feature names. How can I unidex/decode onehot encoded and assembled feature vectors in pyspark? I have a vague idea that I have to use " IndexToString()" but I am not exactly sure because there is a mix of numeric, categorical features and they are one hot encoded and assembled.
Export the Apache Spark ML pipeline to a PMML document using the JPMML-SparkML library. A PMML document can be inspected and interpreted by humans (eg. using Notepad), or processed programmatically (eg. using other Java PMML API libraries).
The "model schema" is represented by the /PMML/MiningModel/MiningSchema element. Each "active feature" is represented by a MiningField element; you can retrieve their "type definitions" by looking up the corresponding /PMML/DataDictionary/DataField element.
Edit: Since you were asking about PySpark, you might consider using the JPMML-SparkML-Package package for export.

Spark/Scala read hadoop file

In a pig script I saved a table using PigStorage('|').
I have in the corresponding hadoop folder files like
part-r-00000
etc.
What is the best way to load it in Spark/Scala ? In this table I have 3 fields: Int, String, Float
I tried:
text = sc.hadoopFile("file", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)
But then I would need somehow to split each line. Is there a better way to do it?
If I were coding in python I would create a Dataframe indexed by the first field and whose columns are the values found in the string field and coefficients the float values. But I need to use scala to use the pca module. And the dataframes don't seem that close to python's ones
Thanks for the insight
PigStorage creates a text file without schema information so you need to do that work yourself something like
sc.textFile("file") // or directory where the part files are
val data = csv.map(line => {
vals=line.split("|")
(vals(0).toInt,vals(1),vals(2).toDouble)}
)

Spark: RDD.saveAsTextFile when using a pair of (K,Collection[V])

I have a dataset of employees and their leave-records. Every record (of type EmployeeRecord) contains EmpID (of type String) and other fields. I read the records from a file and then transform into PairRDDFunctions:
val empRecords = sc.textFile(args(0))
....
val empsGroupedByEmpID = this.groupRecordsByEmpID(empRecords)
At this point, 'empsGroupedByEmpID' is of type RDD[String,Iterable[EmployeeRecord]]. I transform this into PairRDDFunctions:
val empsAsPairRDD = new PairRDDFunctions[String,Iterable[EmployeeRecord]](empsGroupedByEmpID)
Then, I go for processing the records as per the logic of the application. Finally, I get an RDD of type [Iterable[EmployeeRecord]]
val finalRecords: RDD[Iterable[EmployeeRecord]] = <result of a few computations and transformation>
When I try to write the contents of this RDD to a text file using the available API thus:
finalRecords.saveAsTextFile("./path/to/save")
the I find that in the file every record begins with an ArrayBuffer(...). What I need is a file with one EmployeeRecord in each line. Is that not possible? Am I missing something?
I have spotted the missing API. It is well...flatMap! :-)
By using flatMap with identity, I can get rid of the Iterator and 'unpack' the contents, like so:
finalRecords.flatMap(identity).saveAsTextFile("./path/to/file")
That solves the problem I have been having.
I also have found this post suggesting the same thing. I wish I saw it a bit earlier.