Fasttext: How to process a corpora with fasttext? - corpus

I'm new to fasttext and NLP. I have a corpora csv in french structured as follow:
| value | sentence | pivot |
|-------|--------------------------------|----------|
| 1 | My first [sentence] | sentence |
| 0 | My second [word] in a sentence | word |
| .. | ... | ... |
I want to know how to tell fasttext to process the pivot words between brackets [pivot] to build my model, or is it a feature built-in in fasttext that he knows which word to process ? I really want to know the mechanics about fasttext ! the documentation I found is limited. Thanks.

You can extract word vectors of pivot column using fastText in this way:
!git clone https://github.com/facebookresearch/fastText.git
!cd fastText
!pip install fastText
import fasttext.util
fasttext.util.download_model('fr', if_exists='ignore') # French
model = fasttext.load_model('cc.en.300.bin')
vectors = []
dataset = pd.read_csv('path to csv file', sep='\t')
for data in dataset.pivot:
vectors.append(model[data])
https://fasttext.cc/docs/en/crawl-vectors.html

Related

pyspark dataframe: add a new indicator column with random sampling

I have a spark dataframe containing the following schema:
StructField(email_address,StringType,true),StructField(subject_line,StringType,true)))
I want randomly sample 50% of the population into control and test. Currently I am doing it the following way:
df_segment_ctl = df_segment.sample(False, 0.5, seed=0)
df_segment_tmt = df_segment.join(df_segment_ctl, ["email_address"], "leftanti")
But I am certain there must be a better way to create a column instead like the following
| email_address| segment_id|group |
+--------------------+---------------+---------+
|xxxxxxxxxx#gmail.com| 1.1|treatment|
| xxxxxxx#gmail.com| 1.6|control |
Any help is appreciated. I am new to this world
UPDATE:
I dont want to split the dataframe into two. Just want to add an indicator column
UPDATE:
Is it possible to have multiple splits elegantly. Suppose instead of two groups I want a single control and two treatment
| email_address| segment_id|group |
+--------------------+---------------+---------+
|xxxxxxxxxx#gmail.com| 1.1|treat_1. |
| xxxxxxx#gmail.com| 1.6|control |
| xxxxx#gmail.com | 1.6|treat_2 |
You can split the spark dataframe using random split like below
df_segment_ctl, df_segment_tmt = df_segment.randomSplit(weights=[0.5,0.5], seed=0)

How to groupBy and perform data scaling over each and every group using MlLib Pyspark?

I have a dataset just like in the example below and I am trying to group all rows from a given symbol and perform standard scaling of each group so that at the end all my data is scaled by groups. How can I do that with MlLib and Pyspark? I could not find a single solution on internet for it. Can anyone help here?
+------+------------------+------------------+------------------+------------------+
|symbol| open| high| low| close|
+------+------------------+------------------+------------------+------------------+
| AVT| 4.115| 4.115| 4.0736| 4.0736|
| ZEC| 365.6924715181936| 371.9164684545918| 364.8854025324053| 369.5950712239761|
| ETH| 647.220769018717| 654.6370842160561| 644.8942258095359| 652.1231757197687|
| XRP|0.3856343600456335|0.4042970302356221|0.3662228285447956|0.4016658006619401|
| XMR|304.97650674864144|304.98649644294267|299.96970554155274| 303.8663243145598|
| LTC|321.32437862304715| 335.1872636382617| 320.9704201234651| 334.5057757774086|
| EOS| 5.1171| 5.1548| 5.1075| 5.116|
| BCH| 1526.839255299505| 1588.106037653013|1526.8392543926366|1554.8447136830328|
| DASH| 878.00000003| 884.03769206| 869.22000004| 869.22000004|
| BTC|17042.224796462127| 17278.87984139109|16898.509289685637|17134.611038665582|
| REP| 32.50162799| 32.501628| 32.41062673| 32.50162799|
| DASH| 858.98413357| 863.01413927| 851.07145059| 851.17051529|
| ETH| 633.1390884474979| 650.546984589714| 631.2674221381849| 641.4566047907362|
| XRP|0.3912300406160967|0.3915937383961073|0.3480682353334925|0.3488616679337076|
| EOS| 5.11| 5.1675| 5.0995| 5.1674|
| BCH|1574.9602789966184|1588.6004569127992| 1515.3| 1521.0|
| BTC| 17238.0199449088| 17324.83886467445|16968.291408828714| 16971.12960974206|
| LTC| 303.3999614441217| 317.6966006615225|302.40702519057584| 310.971265429805|
| REP| 32.50162798| 32.50162798| 32.345677| 32.345677|
| XMR| 304.1618444641083| 306.2720324372592|295.38042671416935| 295.520097663825|
+------+------------------+------------------+------------------+------------------+
I suggest you import the following:
import pyspark.sql.functions as f
then you can do it like this (not fully tested code):
stats_df = df.groupBy('symbol').withColumn(\
'open', f.mean("open")).alias("open_mean")\
.withColumn(\
'open', f.stddev("open")).alias("open_stddev").collect()
This is the principle of how you would do it (you could use instead the min and max functions for a MinMax scaling), then you just have to apply the formula of standard scaling to stats_df:
x' = (x - μ) / σ

How to decode HTML entities in Spark-scala?

I have a spark code to read some data from a database.
One of the columns (of type string) named "title" contains the following data.
+-------------------------------------------------+
|title |
+-------------------------------------------------+
|Example sentence |
|Read the ‘Book’ |
|‘LOTR’ Is A Great Book |
+-------------------------------------------------+
I'd like to remove the HTML entities and decode it to look as given below.
+-------------------------------------------+
|title |
+-------------------------------------------+
|Example sentence |
|Read the ‘Book’ |
|‘LOTR’ Is A Great Book |
+-------------------------------------------+
There is a library "html-enitites" for node.js that does exactly what I am looking for,
but i am unable to find something similar for spark-scala.
What would be good approach to do this?
You can use org.apache.commons.lang.StringEscapeUtils with a help of UDF to achieve this.
import org.apache.commons.lang.StringEscapeUtils;
val decodeHtml = (html:String) => {
StringEscapeUtils.unescapeHtml(html);
}
val decodeHtmlUDF = udf(decodeHtml)
df.withColumn("title", decodeHtmlUDF($"title")).show()
/*
+--------------------+
| title|
+--------------------+
| Example sentence |
| Read the ‘Book’ |
|‘LOTR’ Is A Great...|
+--------------------+
*/

Stata, string variable to quartely time series

I have a data set with this type of observations:
"2015_1"
"2015_2"
"2015_3"
I want to convert to time series(quarterly), like:
2015q1
2015q2
2015q3
This is a standard conversion task. See help datetime and help datetime display formats for the detail.
* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 have
"2015_1"
"2015_2"
"2015_3"
end
gen wanted = quarterly(have, "YQ")
format wanted %tq
list
+-----------------+
| have wanted |
|-----------------|
1. | 2015_1 2015q1 |
2. | 2015_2 2015q2 |
3. | 2015_3 2015q3 |
+-----------------+
describe
Contains data
obs: 3
vars: 2
---------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------------------------------------
have str6 %6s
wanted float %tq
-------------------------------------------------------------------------------------

How to filter a column in a data frame by the regex value of another column in same data frame in Pyspark

I am trying to filter a column in data frame that matches the regex pattern given in another column
df = sqlContext.createDataFrame([('what is the movie that features Tom Cruise','actor_movies','(movie|film).*(feature)|(in|on).*(movie|film)'),
('what is the movie that features Tom Cruise','artist_song','(who|what).*(sing|sang|perform)'),
('who is the singer for hotel califonia?','artist_song','(who|what).*(sing|sang|perform)')],
['query','question_type','regex_patt'])
+--------------------+-------------+----------------------------------------- -+
| query |question_type |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise | actor_movies | (movie|film).*(feature)|(in|on).*(movie|film)
|what movie features Tom Cruise | artist_song | (who|what).*(sing|sang|perform)
|who is the singer for hotel califonia | artist_song | (who|what).*(sing|sang|perform) |
+--------------------+-------------+------------------------------------------------+
I want to prune the data frame such that only to keep rows whose query matches the regex_pattern column value.
The final result should like this
+--------------------+-------------+----------------------------------------- -+
| query |question_type |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise | actor_movies | (movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia | artist_song | (who|what).*(sing|sang|perform)
+--------------------+-------------+------------------------------------------------+
I was thinking of
df.filter(column('query').rlike('regex_patt'))
But rlike only accepts regex strings.
Now the question is, how to filter the "query" column based on the regex value of "regex_patt" column?
You could try this. The expression allows you to put columns as the str and pattern.
from pyspark.sql import functions as F
df.withColumn("query1", F.expr("""regexp_extract(query, regex_patt)""")).filter(F.col("query1")!='').drop("query1").show(truncate=False)
+------------------------------------------+-------------+---------------------------------------------+
|query |question_type|regex_patt |
+------------------------------------------+-------------+---------------------------------------------+
|what is the movie that features Tom Cruise|actor_movies |(movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia? |artist_song |(who|what).*(sing|sang|perform) |
+------------------------------------------+-------------+---------------------------------------------+