GENSIM: 'TypeError: doc2bow expects an array of unicode tokens on input, not a single string' when trying to create mapping for dictionary - ipython

my text looks as follows:
text=['paris', 'shares', 'concerns', 'ecb', 'language', 'eroding', 'status', 'currency', 'union',
'diluting', 'legal', 'obligation', 'most', 'countries', 'join', 'ultimately', 'however', 'welcomes',
'britain', 'support', 'more', 'integrated', 'eurozone', 'recognises', 'uk', 'euro', 'means',
'obliged', 'choose', 'between', 'euro', 'pound', 'comment', 'article', 'moved', 'debates',
'february', 'language', 'english', 'web']
from gensim.corpora.dictionary import Dictionary
dictionary=Dictionary(text)
The error I'm getting:
TypeError: doc2bow expects an array of unicode tokens on input, not a
single string
I've tried to transform my text into a list of words to no avail. Also, I've tried to transform it to unicode to no avail. I'm no python expert just trying to analyse some text. My next step would be to check how often each token appears in the document called text. I'm using the ipython notebook.

Related

html2pdf not showing character correctly, encoding for ē

I'm struggling with some characters in a PDF I'm trying to create with html2pdf. The following code creates the PDF, but ē is shown an e.
$html2pdf=new Html2Pdf();
$html2pdf->writeHTML('<h1>Fēnix</h1>');
$html2pdf->output();
When getting the name from my database, ē is shown a ?.
$query=$mysqli->query('SELECT name FROM table WHERE id=1;');
$result=$query->fetch_assoc();
$html2pdf=new Html2Pdf();
$html2pdf->writeHTML('<h1>'.$result['name'].'</h1>');
$html2pdf->output();
This is the way I connect to my database:
$mysqli=new mysqli('host', 'user', 'pass', 'db');
I have also tried adding a charset:
$mysqli->set_charset('utf8');
Or initiating the class with parameters:
$html2pdf=new Html2Pdf('P', 'A4', 'nl');
$html2pdf=new Html2Pdf('P', 'A4', 'nl', true, 'UTF8');
Other characters that are giving issues are: Ś ą ł ś
Both server and database are UTF-8.
The solution is to apply a UTF-8 font to all elements.
* { font-family:freeserif; }

TYPO3 v9.5.0 - Bootstrap Package url error message

I have a TYPO3 9.5.0LTS and use the bootstrap package theme. It seems to be all working ... I defined the site configuration and then I get nice looking urls ... but quite often I get such error messages:
Core: Exception handler (WEB): Uncaught TYPO3 Exception: #1436717266: Invalid header value for header "Expire"". The value must be a string or an array of strings. | InvalidArgumentException thrown in file /is/www/typo3_src-9.5.0/typo3/sysext/core/Classes/Http/Message.php in line 208. Requested URL: domain/content-examples/media/audio
What causes this and how to prevent this?
Edit: Might be this part in TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController::getHttpHeadersForTemporaryContent() on line 4244:
/**
* Returns HTTP headers for temporary content.
* These headers prevent search engines from caching temporary content and asks them to revisit this page again.
* Please ensure to also send a 503 HTTP Status code with these headers.
*/
protected function getHttpHeadersForTemporaryContent(): array
{
return [
'Retry-after' => '3600',
'Pragma' => 'no-cache',
'Cache-control' => 'no-cache',
'Expire' => 0,
];
}
... so I change it to 'Expires' => 0
https://forge.typo3.org/issues/86651#change-388813
It seems there's a typo in "Expire" header, should be "Expires".
Try to change it in:
TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController::getHttpHeadersForTemporaryContent()
while they're fixing this problem
UPD
TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController, line 4244
'Expire' => 0,
change to
'Expires' => '0',
https://forge.typo3.org/issues/86658
And correct header name should be 'Expires' afaik:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Expires
I think to change the file:
typo3_src-9.5.0/typo3/sysext/frontend/Classes/Controller/TypoScriptFrontendController.php
on line 4244 from
'Expire' => 0,
to
'Expire' => '0',
helps. The issue is reported https://forge.typo3.org/issues/86658 and will be changed with the next update, I am sure.

Mongodb '$where' query using javascript regex

I am trying to reproduce the REPLACE function in sql on mongodb.
My collection 'log' has this data in the 'text' field.
[01]ABC0007[0d0a]BB BABLOXC[0d0a]067989 PLPXBNS[0d0a02]BBR OIC002 L5U0/P AMD KAP 041800 T1200AND 2+00[0d0a0b03]
All I'm trying to do is remove the '[..]' using regex (javascript) and use contains('PLPXBNSBBR') like this so that the expression return true per the javadocs in mongo documentation.
This query below successfully works and returns the matching rows.
db.log.find({"$where":"return this.message.replace(new RegExp('0d0a02'),'').contains('PLPXBNS[]BBR') "});
However, I would need to remove the '[..]' and match like this PLPXBNSBBR.
These are the ones I tried unsuccessfully
db.log.find({"$where" : " return this.message.replace( new
RegExp('\[.*?\]', 'g'), '' ).contains('PLPXBNSBBR') " });
db.log.find({"$where" : " return this.message.replace( new
RegExp('/[(.*?)]', 'g'), '' ).contains('PLPXBNSBBR') " });
db.log.find({"$where" : " return this.message.replace( new
RegExp('//[.*//]'), '' ).contains('PLPXBNSBBR') " });
db.log.find({"$where" : " return this.message.replace( new
RegExp('[.*?]'), '' ).contains('PLPXBNSBBR') " });
From the earlier discussion it appears that if I can pass the pattern as /[[^]]+]/g, it should strip the [..] but it is not doing that and not returning the matching rows.
Okay, I was able to use chaining replace successfully to get my desired results.

How to create data frames from rdd of word's list

I have gone through all the answers of the stackoverflow and on internet but nothing works.so i have this rdd of list of words:
tweet_words=['tweet_text',
'RT',
'#ochocinco:',
'I',
'beat',
'them',
'all',
'for',
'10',
'straight',
'hours']
**What i have done till now:**
Df =sqlContext.createDataFrame(tweet_words,["tweet_text"])
and
tweet_words.toDF(['tweet_words'])
**ERROR**:
TypeError: Can not infer schema for type: <class 'str'>
Looking at the above code, you are trying to convert a list to a DataFrame. A good StackOverflow link on this is: https://stackoverflow.com/a/35009289/1100699.
Saying this, here's a working version of your code:
from pyspark.sql import Row
# Create RDD
tweet_wordsList = ['tweet_text', 'RT', '#ochocinco:', 'I', 'beat', 'them', 'all', 'for', '10', 'straight', 'hours']
tweet_wordsRDD = sc.parallelize(tweet_wordsList)
# Load each word and create row object
wordRDD = tweet_wordsRDD.map(lambda l: l.split(","))
tweetsRDD = wordRDD.map(lambda t: Row(tweets=t[0]))
# Infer schema (using reflection)
tweetsDF = tweetsRDD.toDF()
# show data
tweetsDF.show()
HTH!

how to get a format CSV without brackets and u letters from python script

How do I remove the u letter in the python script? when I run the following script, it exports data to CSV from Mongodb, but it brings u letters in CSV fiel. I tried to use many solution that did not work. please someone can help me.
the sample of python script :
import codecs
import csv
cursor = db.workflows.find( {}, {'_id': 1, 'stages.interview': 1, 'stages.hmNotification': 1, 'stages.hmStage': 1, 'stages.type':1, 'stages.isEditable':1, 'stages.order':1, 'stages.name.en':1, 'stages.stageId':1 })
with codecs.open('stages2.csv', 'w', encoding='utf-8') as outfile:
fields = ['_id', 'stages.interview', 'stages.hmNotification', 'stages.hmStage', 'stages.type', 'stages.isEditable','stages.order', 'stages.name', 'stages.stageId']
write = csv.DictWriter(outfile, fieldnames=fields)
write.writeheader()
for stages_record in cursor:
stages_record_id = stages_record['_id']
for stage_record in stages_record['stages']:
x = {
'_id': stages_record_id,
'stages.interview': stage_record['interview'],
'stages.hmNotification': stage_record['hmNotification'],
'stages.hmStage': stage_record['hmStage'],
'stages.type': stage_record.get('type'),
'stages.isEditable': stage_record['isEditable'],
'stages.order': stage_record['order'],
'stages.name': stage_record['name'],
'stages.stageId': stage_record['stageId']}
write.writerow(x)
The output of csv should not have '{u'en':u'", brackets'. how do I fix the python script? please help me.
the sample of CSV :
id,stages.interview,stages.hmNotification,stages.hmStage,stages.type,stages.isEditable,stages.order,stages.name,stages.stageId
5318cbd9a377f52a6a0f671f,False,False,False,new,False,0,{u'en': u'New'},51d1a2f4c0d9887b214f3694