I'm trying to vectorise my data using spark tf/idf-like functions. So, as an output I get very long sparse vectors of features where only few indexes have values.
I was thinking to save such vectors into MongoDB array-like objects, having present indexes as keys. So, for example a SparseVector like
(23,[0,15],[1.0,1.0])
would be converted into MongoDB object as follows:
{"0": 1.0, "15": 1.0}
How can I do it using spark scala and mongodb connector?
I should probably implement some kind of UDF but not sure what type would fit mongodb as an input.
Ok, I found the solution.
Here's the udf I defined to convert SparseVectors to BSON-convertable Map.
val makeSparseMapUdf = udf {
(vec: SparseVector) => vec.indices
.map((index) => (index.toString, vec.toArray(index)))
.toMap
}
Related
You can use df.with_columns(pl.col('A').set_sorted()) to say to polars that column A is sorted.
I assume that internally there is some metadata associated to this column. Is there a way to read it ?
I see that polars algorithms are much faster if dataframes are sorted, sometimes I want to be sure that am I taking these fast paths.
Proposal : Is it possible to have a Lazy/DataFrame attribute like metadata that would store this type of information.
Something like that:
df.metadata
{'A' : {'is_sorted' : True}}
This information is stored on the Series.
>>> s = pl.Series([1, 3, 2]).sort()
>>> s.flags
{'SORTED_ASC': True, 'SORTED_DESC': False}
I am trying to read a huge complex document from MongoDB into spark data frame. When I convert this db to json, It works. But If I directly read from MongoDB I am getting the following error : Caused by: com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a DoubleType (value: BsonString{value='NaN'})
Able to read into DF and do all the processing. Getting error when I try to show it or write to a json/csv.
at mongo$.main(mongo.scala:270) – df.show()
Using sbt for Dependencies
mongo spark connector: 2.2.1
Scala Version: 2.11.6 Spark version: 2.3.0/2.2.0
As described by the error, this is because there is a String value of "NaN" which is inferred in the Spark schema as Double type.
There's a value for the field, amongst all the documents that is not Double. for example :
{_id:1, foo: 100.00}
{_id:2, foo: 101.00}
{_id:3, foo: 102.00}
{_id:4, foo: 103.00}
...
{_id:99, foo: "NaN"}
{_id:100, foo: 200.00}
As you may know, "NaN" means "Not a Number". It is likely during creation of the document that whichever process failed to insert Double, and defaulted into NaN.
There are few ways to solve this, depending on your use case:
Utilise MongoDB Schema Validation to ensure that the values within the collection has the expected type on insert.
Perform transformation to clean the data. Query the collection to find the offending field i.e. {foo: "NaN"} and update with a desired value i.e. 0.
I had a similar conversion problem. Mongodb takes a sample of 1000 documents to define the scheme. In my case, 1000 documents were not sufficient to cover all cases. I increased the sample size and this solved the problem. Mongo documentation
Code:
val readConfig = ReadConfig(Map(
"database" -> "myDatabase",
"collection" -> "myCollection",
"sampleSize" -> "100000"), Some(ReadConfig(sc)))
val df = sc.loadFromMongoDB(readConfig).toDF()
How to sort Array[Row] by given column index in Scala?
I'm using RDD[Row].collect() which gives me array[Row], but I want to sort it based on a given column index.
I have already used quick-sort logic and it's working, but there are too many for loops and all.
I would like to use a Scala built-in API which can do this task with the minimum amount of code.
It would be much more efficient to sort the Dataframe before collecting it - if you collect it, you lose the distributed (and parallel) computation. You can use Dataframe's sort, for example - ascending order by column "col1":
val sorted = dataframe.sort(asc("col1"))
I am running a spark job and at some point I want to connect to an elastic search server to get some data and add them to an RDD. So the code I am using looks like this
input.mapParitions(records=>{
val elcon=new ElasticSearchConnection
val client:TransportClient=elcon.openConnection()
val newRecs=records.flatMap(record=>{
val response = client.prepareGet("index" "indexType",
record.id.toString).execute().actionGet()
val newRec=processRec(record,reponse)
newRec
})//end of flatMap
client.close()
newRecs
})//end of mapPartitions
My problem is that the client.close() command is called before the flatMap operation is finished which results of course into an Exception. The code works if I move the generation and the closing of the connection inside the flatMap, but this would generate a huge amount of connections. Is it possible to make sure that client.close will be called after the flatMap operation is finished?
Making a blocking call for each item in your RDD to fetch corresponding ElasticSearch document is causing the problem. It is generally advised to avoid blocking calls.
There is another alternate approach using the ElasticSearch-for-Hadoop's Spark support.
Read the ElasticSearch index/type as another RDD and join it with your RDD.
Include the right version of ESHadoop dependency.
import org.elasticsearch.spark._
val esRdd = sc.esRDD("index/indexType") //This returns a pair RDD of (_id, Map of all key value pairs for all fields]
input.map(record => (record.id, record)) //Convert your RDD of records to a pair rdd of (id, record) as we want to join based on the id
input.join(esRdd).map(rec => processResponse(rec._2._1, rec._2._2)) // Join the two RDDs based on id column it returns a pair RDD with key=id & value=Pair of matching records (id,(inputrddrecord,esrddrecord))
Hope this helps.
PS: It will still not alleviate the problem of lack of co-location. (i.e. each document with _id will come from different shard of the index). Better approach would have been to achieve co-location of the input RDD and the ES index's documents at the time of creating the ES index.
If I have an RDD of tuples with 5 elements, e.g.,
RDD(Double, String, Int, Double, Double)
How can I sort this RDD efficiently using the fifth element?
I tried to map this RDD into key-value pairs and used sortByKey, but looks like sortByKey is quite slow, it is slower than I collected this RDD and used sortWith on the collected array. Why is it like this?
Thank you very much.
You can do this with sortBy acting directly on the RDD:
myRdd.sortBy(_._5) // Sort by 5th field of each 5-tuple
There are extra optional parameters to define sort order ("ascending") and number of partitions.
If you want to sort by descending order & if the corresponding element is of type int, you can use "-" sign to sort the RDD in descending order.
For ex:
I've a RDD of tuple with (String, Int). To sort this RDD by its 2nd element in descending order,
rdd.sortBy(x => -x._2).collect().foreach(println);
I've a RDD of tuple with (String, String). To sort this RDD by its 2nd element in descending order,
rdd.sortBy(x => x._2, false).collect().foreach(println);
sortByKey is the only distributed sorting API for Spark 1.0.
How much data are you trying to sort? Small amount will result in faster local/centralized sorting. If you try to sort GB and GB of data that may not even fit on a single node, that's where Spark shines.