Function Corpus in Quanteda doesn't work because of a kwic objects - corpus

First of all, I'm working on a big data project which consists in analyze some press URLs to detect the most popular topics. My topic is about football (Mbappe contract) and I collected 180 URLs from Marca, a Spanish media mass, in a .txt file.
When I want to create a matrix-document with Corpus function from Quanteda package, I obtain this: Error: corpus() only works on character, corpus, Corpus, data.frame, kwic objects.
In some URLs there is a kwic object (maybe a video, adverts...) that doesn't allow me to work just with text, and I think it's because when inspecting HTML div class = body, automatically picks these kwic objects.
I leave here my code to read it:
url_marca <- read.table("mbappe.txt",stringsAsFactors = F)$V1
get_marca_text <- function(url){url %>%
read_html() %>%
html_nodes("div.ue-c-article__body") %>%
html_text() %>%
str_replace_all("[\r\n]" , "")}
text_marca_mbappe <- sapply(url_marca,get_marca_text)
Does anyone know if is it because of a mistake in html_notes when inspecting the URL or is it something different?


Using a special character for figure legend

I need to correctly spell an Indigenous name on a figure I am developing in R.
To start, the Geography was "Nisga'a Lands". Ultimately I want it to read "Nisg̱̱a'a Lands". So, the g becomes a g with a dash below (g̱̱).
I tried simply copying and pasting this and mutating the data frame, as well as playing with the encoding:
all_income_data = %>%
mutate(Geography = stri_enc_toutf8(Geography)) %>%
mutate(Geography = ifelse(Geography == "Nisga'a Lands", "Nisg̱̱a'a Lands", Geography))
I unfortunately was only able to produce this result:
Is it possible to get the g the way I need it? Thanks so much in advance for any help

issue with biofam3c data in SeqHMM

I have the data which is biofam3c in package SeqHMM in list format. I wanted to see it in matrix form. I used this command
matrix_biofam3c= matrix(unlist(biofam3c), ncol=16, byrow= TRUE)
But its results was confusing . Then I also tried to change matrix_biofam3c into list using a command
U= as.list(
it gives not similar results like given form in package (List of length 4). I request please, suggest the right way to make biofam3c in matrix data (like biofam in package TraMineR). And also how to convert it again so it becomes like list as it is given there.
The data biofam3c is a list of three matrices that contain sequence data of three channels, namely children, married, and left (for left home). Tables children and left have binary data (children vs childless, left home vs with parents), while there are three states for married: single, married, and divorced.
The help page of biofam3c gives the code used to generate these three channels from the biofam data that ships with TraMineR. Note that this code assumes that we do not have divorced people living with their parents. The biofam data does not distinguish between divorced with and without children, nor between divorced who have left home and those who did not. See also the example in the help page of the seqdistmc function of TraMineR.
You can reconstruct biofam from the three channels using interaction
## the 3 channels
children <- biofam3c[[1]]
married <- biofam3c[[2]]
left <- biofam3c[[3]]
## Reconstructing biofam
bf.rec <- matrix(NA,nrow(children),ncol(children))
for (i in 1:ncol(children)) {
bf.rec[,i] <- as.matrix(interaction(children[,i],married[,i],left[,i]))
alph <- seqstatl(bf.rec)
## reordering states as in biofam
alphabet <- alph[c(6,5,4,3,10,9,8,7,2,1)]
## creating a single state divorced as in biofam
bf.rec[bf.rec %in% alphabet[8:10]] <- "divorced"
alphabet[8] <- "divorced"
alphabet <- alphabet[1:8] <- seqdef(bf.rec, alphabet=alphabet)
You get the same figure using the original biofam
## in biofam sequence data are in columns 10 to 25
bf <- biofam[,10:25] <- seqdef(bf)

Mozilla Deep Speech SST suddenly can't spell

I am using deep speech for speech to text. Up to 0.8.1, when I ran transcriptions like:
byte_encoding = subprocess.check_output(
"deepspeech --model deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.1-models.scorer --audio audio/2830-3980-0043.wav", shell=True)
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I would get back results that were pretty good. But since 0.8.2, where the scorer argument was removed, my results are just rife with misspellings that make me think I am now getting a character level model where I used to get a word-level model. The errors are in a direction that looks like the model isn't correctly specified somehow.
Now I when I call:
byte_encoding = subprocess.check_output(
['deepspeech', '--model', 'deepspeech-0.8.2-models.pbmm', '--audio', myfile])
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I now see errors like
endless -> "endules"
service -> "servic"
legacy -> "legaci"
earning -> "erting"
before -> "befir"
I'm not 100% that it is related to removing the scorer from the API, but it is one thing I see changing between releases, and the documentation suggested accuracy improvements in particular.
Short: The scorer matches letter output from the audio to actual words. You shouldn't leave it out.
Long: If you leave out the scorer argument, you won't be able to detect real world sentences as it matches the output from the acoustic model to words and word combinations present in the textual language model that is part of the scorer. And bear in mind that each scorer has specific lm_alpha and lm_beta values that make the search even more accurate.
The 0.8.2 version should be able to take the scorer argument. Otherwise update to 0.9.0, which has it as well. Maybe your environment is changed in a way. I would start in a new dir and venv.
Assuming you are using Python, you could add this to your code:
ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)
And check the example script.

how to read CSV file in scala

I have a CSV file and I want to read that file and store it in case class. As I know A CSV is a comma separated values file. But in case of my csv file there are some data which have already comma itself. and it creates new column for every comma. So the problem how to split data from that.
1st data
04/20/2021 16:20(1st column) Here a bunch of basic techniques that suit most businesses, and easy-to-follow steps that can help you create a strategy for your social media marketing goals.(2nd column)
2nd data
11-07-2021 12:15(1st column) Focus on attracting real followers who are genuinely interested in your content, and make the most of your social media marketing efforts.(2nd column)
var i=0
var length=0
val data=Source.fromFile(file)
for (line <- data.getLines) {
val cols = line.split(",").map(_.trim)
length = cols.length
If you are reading a complex CSV file then the ideal solution is to use an existing library. Here is a link to the ScalaDex search results for CSV.
ScalaDex CSV Search
However, based on the comments, it appears that you might actually be wanting to read data stored in a Google Sheet. If that is the case, you can utilize the fact that you have some flexibility to save the data in a text file yourself. When I want to read data from a Google Sheet in Scala, the approach I use first is to save the file in a format that isn't hard to read. If the fields have embedded commas but no tabs, which is common, then I will save the file as a TSV and parse that with split("\t").
A simple bit of code that only uses the standard library might look like the following:
val source ="data.tsv")
val data ="\t")).toArray
After this, data will be an Array[Array[String]] with your data in it that you can process as you desire.
Of course, if your data includes both tabs and commas then you'll really want to use one of those more robust external libraries.
You could use univocity CSV parser for faster stuffs.
You can also use it for creation as well.
Univocity parsers

importing website to google sheets

I have tried searching everywhere online for a good answer but cannot seem to find anything that matches specifically what i am looking for.
When i use the IMPORTHTML function in google sheets, i end up with data that looks like:
${} (${player.position}, ${team.abbrev}) ${opponent.abbrev} #${opponent_rank} ${minutes} ${pts} ${fgm}-${fga} ${ftm}-${fta} ${p3m}-${p3a} ${treb} ${ast} ${stl} ${blk} ${tov} ${pf} ${fp} $${salary} ${ratio}
the code that i am using looks like this:
=IMPORTHTML("", "table",2)
When I use the same as above (=IMPORTHTML("", "table",2)) only with "0" as my index, it pulls this:
Opp Stats
Player Team Rank Min Pts FGM/A FTM/A 3PM/A Reb Ast Stl Blk Tov Foul FP Cost Value
Basically, I am attempting to pull the table data from this website:
(because of my rep i cannot post more than two links, however my IMPORTHTML function has the above link input in both functions)
into a google sheet. Please help. any feedback is much appreciated... thanks!
Best advice is to find another Web table you can import. If you do "view source" on the page, you will find that the table content is dynamically populated from a variable named NF_DATA.
You need to create a document script to extract the data you want:
function this_is_test() {
var response = UrlFetchApp.fetch("");
raw_content = response.getContentText();
re = new RegExp('"daily_projections":\\[[^\\]]+','i');
proj = raw_content.match(re);
It will extract all text in-between "daily_projections":[ and ], which is (as of today):
Note that even this is not complete. You need to somehow map nba_player_id to the appropriate name. Anyway, a lot coding will be involved...