I have a dataframe(df) with 1 million rows and two columns (ID (long int) and description(String)). After transforming them into tfidf (using Tokenizer, HashingTF, and IDF), the dataframe, df has two columns (ID and features (sparse vector).
I computed the item-item similarity matrix using udf and dot function.
Computing the similarities is done successfully.
However, when I'm calling the show() function getting
"raise EOFError"
I read so many questions on this issue but did not get right answer yet.
Remember, if I apply my solution on a small dataset (like 100 rows), everything is working successfully.
Is it related to the out of memory issue?
I checked my dataset and description information, I don't see any records with null or unsupported text messages
dist_mat = data.alias("i").join(data.alias("j"), psf.col("i.ID") < psf.col("j.ID")) \
.select(psf.col("i.ID").alias("i"), psf.col("j.ID").alias("j"),
dot_udf("i.features", "j.features").alias("score"))
dist_mat = dist_mat.filter(psf.col('score') > 0.05)
dist_mat.show(1)```
If I removed the last line dist_mat.show(), it is working without error. However, when I used this line, got the error like
.......
```Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded```
...
Here is the part of the error message:
```[Stage 6:=======================================================> (38 + 1) / 39]Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 397, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
raise EOFError
EOFError```
I increased the cluster size and run it again. It is working without errors. So, the error message is true
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
However, computing the pairwise similarities for such a large scale matrix, I found an alternative solution, Large scale matrix multiplication with pyspark
In fact, it is very efficient and much more faster, even better than the use of BlockMatrix
Related
I'm working in a Google Colab notebook and set up via
!wget http://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu
a quick version check nlu.version() confirms 3.4.2
Several of the official tutorial notebooks (for ex.: XLNet) create a multi-model pipeline that includes both 'sentiment' and 'emotion'.
Direct copy of content from the notebook:
import pandas as pd
# Download the dataset
!wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sarcasm/train-balanced-sarcasm.csv -P /tmp
# Load dataset to Pandas
df = pd.read_csv('/tmp/train-balanced-sarcasm.csv')
pipe = nlu.load('sentiment pos xlnet emotion')
df['text'] = df['comment']
max_rows = 200
predictions = pipe.predict(df.iloc[0:100][['comment','label']], output_level='token')
predictions
However, running a prediction on this pipe results in the following error:
sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]
xlnet_base_cased download started this may take some time.
Approximate size to download 417.5 MB
[OK!]
classifierdl_use_emotion download started this may take some time.
Approximate size to download 21.3 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
---------------------------------------------------------------------------
IllegalArgumentException Traceback (most recent call last)
<ipython-input-1-9b2e4a06bf65> in <module>()
34
35 # NLU to gives us one row per embedded word by specifying the output level
---> 36 predictions = pipe.predict( df.iloc[0:5][['text','label']], output_level='token' )
37
38 display(predictions)
9 frames
/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in raise_from(e)
IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in SentimentDLModel_6c1a68f3f2c7.
Current inputCols: sentence_embeddings#glove_100d. Dataset's columns:
(column_name=text,is_nlp_annotator=false)
(column_name=document,is_nlp_annotator=true,type=document)
(column_name=sentence,is_nlp_annotator=true,type=document)
(column_name=sentence_embeddings#tfhub_use,is_nlp_annotator=true,type=sentence_embeddings).
Make sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: sentence_embeddings
Having experimented with various combinations of models, it turns out that the problem is caused whenever 'sentiment' and 'emotion' models are specified in the same pipeline (regardless of pipeline order or what other models are listed).
Running pipe = nlu.load('emotion ANY OTHER MODELS') or pipe = nlu.load('sentiment ANY OTHER MODELS') will be successful, so it really appears to be only a result of combining 'sentiment' and 'emotion'
Is this a known bug? Does anyone have any suggestions for fixing?
My temporary solution has been to run emoPipe = nlu.load('emotion').predict() in isolation, then inner join the resulting dataframe to the the resulting df of pipe = nlu.load('sentiment pos xlnet').predict().
However, I would like to understand better what is failing and to know if there is a way to streamline the inclusion of all models.
Thanks
I have a dataframe, which has about 2 million rows with URLs, 2 columns: id and url. I need to parse the domain from the url. I used lambda with urlparse or simple split. But I keep getting EOFError with both ways. If I create a random "sample" of 400 000, it works.
What is also interesting, is that pyspark shows me the top 20 rows with the new column domain, but I cannot do anything with it or I get the error again.
Is it a memory issue or is something wrong with the data? Can somebody please advise me or give me a hint?
I searched several questions regarding this, none of them helped me.
The code:
parse_domain = udf(lambda x: x.split("//")[-1].split("/")[0].split('?')[0],
returnType=StringType())
df = df.withColumn("domain", parse_domain(col("url")))
df.show()
Example urls:
"https://www.dataquest.io/blog/loading-data-into-postgres/"
"https://github.com/geekmoss/WrappyDatabase"
"https://www.google.cz/search?q=pyspark&rlz=1C1GCEA_enCZ786CZ786&oq=pyspark&aqs=chrome..69i64j69i60l3j35i39l2.3847j0j7&sourceid=chrome&ie=UTF-8"
"https://search.seznam.cz/?q=google"
And the error I keep getting:
Traceback (most recent call last):
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 278, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/opt/spark-2.3.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 692, in read_int
raise EOFError
EOFError
I'm currently implementing a program classifier for my coursework.
My lecturer ask me to use "Evolving ANN" algorithm.
So I found a package called NEAT (Neuro Evolution of Augmenting Topologies).
I have 10 inputs and 7 outputs, then I just modify the source from its documentation.
def eval_fitness(genomes):
for g in genomes:
net = nn.create_feed_forward_phenotype(g)
mse = 0
for inputs, expected in zip(alldata, label):
output = net.serial_activate(inputs)
output = np.clip(output, -1, 1)
mse += (output - expected) ** 2
g.fitness = 1 - (mse/44000) #44000 is the number of samples
print(g.fitness)
I had changed the config file too, so the program has 10 inputs and 7 outputs.
But when I try to run the code, it gives me error
Traceback (most recent call last):
File "/home/ilhammaziz/PycharmProjects/tuproSC2/eANN.py", line 40, in <module>
pop.run(eval_fitness, 10)
File "/home/ilhammaziz/.local/lib/python3.5/site-packages/neat/population.py", line 190, in run
best = max(population)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
What I supposed to do?
Thanks
As far as I can tell the error is not in your code but in the library it self. Just use a different one.
This one looks promising to me.
I have a dictionary of PySpark RDDs and am trying to convert them to data frames, save them as variable and then join them. When I attempt to convert one of my RDDs to a data frame I get the following error:
File "./spark-1.3.1/python/pyspark/sql/types.py",
line 986, in _verify_type
"length of fields (%d)" % (len(obj), len(dataType.fields)))
ValueError: Length of object (52) does not match with length of fields (7)
Does anyone know what this exactly means or can help me with a work around?
I agree - we need to see more code - obfuscated data is fine.
You are using SparkQL it seems (sql types) - mapped onto what ? HDFS/Text
From the error it would appear that your create schema is incorrect - leading to an error - when to create a Data Frame.
This was due to the passing of an incorrect RDD, sorry everyone. I was passing the incorrect RDD which caused didn't fit the code I was using.
I have about 1500 files on S3 (each file looks like this:)
Format :
UserId \t ItemId:Score,ItemdId:Score,ItemId:Score \n
UserId \t ItemId:Score,ItemdId:Score,ItemId:Score \n
I read the file as:
import scala.io.Source
val FileRead = Source.fromFile("/home/home/testdataFile1").mkString
Here is an example of what I get:
1152 401368:1.006,401207:1.03
1184 401230:1.119,40049:1.11,40029:1.31
How do I compute the average and standard deviation of the variable 'Score'?
While it's not explicit in the question, Apache Spark is a good tool for doing this in a distributed way. I assume you have set up a Spark cluster. Read the files into an RDD:
val lines: RDD[String] = sc.textFile("s3n://bucket/dir/*")
Pick out the "score" somehow:
val scores: RDD[Double] = lines.map(_.split(":").last.toDouble).cache
.cache saves it in memory. This avoids re-reading the files all the time, but can use a lot of RAM. Remove it if you want to trade speed for RAM.
Calculate the metrics:
val count = scores.count
val mean = scores.sum / count
val devs = scores.map(score => (score - mean) * (score - mean))
val stddev = Math.sqrt(devs.sum / (count - 1))
This question is not new, so maybe I can update the answers.
There are stddev functions (stddev, stddev_pop, and stddev_smap) is SparkSQL (import org.apache.spark.sql.functions) since spark version >= 1.6.0.
I use Apache Commons Math for this stuff (http://commons.apache.org/proper/commons-math/userguide/stat.html), albeit from Java. You can stream stuff through the SummaryStatistics class so you aren't limited to the size of memory. Scala to Java interop should allow you to do this, but I haven't tried it. You should be able to each your way through the File line by line and stream the stuff through an instance of SummaryStatistics. How hard could it be in Scala?
Lookie here, someone is off and Scala-izing the whole thing: https://code.google.com/p/scalalab/wiki/ApacheCommonMathsLibraryInScalaLab
I don't think that storage space should be an issue so I would try putting all of the values into an array of doubles then adding up all of the values then use that and the number of elements in the array to calculate the mean.Then add up all of the absolute values of the differences between the value in the mean and divide that by the number of elements. Then take the square root.