R Web Scraping rvest forms submit_form - forms

I am new to r and not very knowledgeable about html, xml etc. I am trying to scrape a site that requires input from a drop down. It's for an academic paper using text and sentiment analysis on the press releases of members of congress. NOT A PROGRAMMER LOL So be gentle!
memberUrl = 'https://grijalva.house.gov/press-releases/'
session <- html_session(memberUrl)
forms <- html_form(session)
yearForm <- forms[[4]]
#--- so far so good (I think) -- and i have successfully scraped sites that don't have drop downs
#--- but here is where I get confused and can't find a good tutorial on forms and submit_form
set_values(yearForm, ??? ) #----- get stuck on how to use set_values
submit_form( session, yearForm, ???) #--- and here
Thanks! Jim

submit_form didn't work, maybe because that form uses JS to submit. Here is the solution:
library(rvest)
memberUrl = 'https://grijalva.house.gov/press-releases/'
session <- html_session(memberUrl)
session <- rvest:::request_POST(session,
memberUrl,
body = list(
getNewsByyear = "2018" #change the value here, 'getNewsByyear' is the name of the dropdown list
))
titles <- read_html(session) %>%
html_nodes("ul > li > h3") %>%
html_text()

Related

Function Corpus in Quanteda doesn't work because of a kwic objects

First of all, I'm working on a big data project which consists in analyze some press URLs to detect the most popular topics. My topic is about football (Mbappe contract) and I collected 180 URLs from Marca, a Spanish media mass, in a .txt file.
When I want to create a matrix-document with Corpus function from Quanteda package, I obtain this: Error: corpus() only works on character, corpus, Corpus, data.frame, kwic objects.
In some URLs there is a kwic object (maybe a video, adverts...) that doesn't allow me to work just with text, and I think it's because when inspecting HTML div class = body, automatically picks these kwic objects.
I leave here my code to read it:
url_marca <- read.table("mbappe.txt",stringsAsFactors = F)$V1
get_marca_text <- function(url){url %>%
read_html() %>%
html_nodes("div.ue-c-article__body") %>%
html_text() %>%
str_replace_all("[\r\n]" , "")}
text_marca_mbappe <- sapply(url_marca,get_marca_text)
Does anyone know if is it because of a mistake in html_notes when inspecting the URL or is it something different?

Issue with DatePicker - RSelenium

I'm scraping publicly available data for academic research. The website I'm pulling the information from has a really annoying datepicker though. I'm not sure if they implement this to deter private companies from scraping criminal data but it seems pretty dumb.
Here's the url.
I can bypass the Captcha with my institutional credentials, FYI.
You can see code - minus the login information - below:
#Miami Scraper
rm(list=ls())
remDr$close()
rm(rD)
gc()
rm(list=ls())
setwd("~/Desktop/Miami Scrape")
library(httr)
library(rvest)
library(zoo)
library(anytime)
library(lubridate)
library(dplyr)
library(RSelenium)
browser <- remoteDriver(port = 5556, browserName = "firefox")
remDr<-browser[["client"]]
url <- "https://www2.miami-dadeclerk.com/PremierServices/login.aspx"
rD <- rsDriver(verbose=FALSE,port=4444L,browser="firefox")
remDr <- rD$client
remDr$navigate(url)
#Click the Logging In Option
#Log-in stuff happens here
url2 <- "https://www2.miami-dadeclerk.com/cjis/casesearch.aspx"
remDr <- rD$client
remDr$navigate(url2)
#Here, you will read in the sheets. Let's start with a handful
date <- c("02", "01", "01")
sequence <- c("030686","027910","014707")
seqbar <- remDr$findElement("id","txtCaseNo3")
seqbar$sendKeysToElement(list(sequence[1]))
type <- remDr$findElement("id","ddCaseType")
type$clickElement()
type$sendKeysToElement(list("F","\n"))
yearbar <- remDr$findElement("id","txtCaseNo2")
yearbar$clearElement()
prev <- remDr$setTimeout("2000")
yearbar$sendKeysToElement(list(date[1]))
Invariably, the datepicker defaults to 19 but this isn't systematic. I'm only beginning to develop the code but I notice if I use the same case information for two searches in a row that it'll switch from "02" to "19" regularly. If I switch to another case, it may not work either. I'm not sure how to deal with this datepicker. Any help would be greatly appreciated.
I've tried a couple of things. As you can see, I've tried to clear out the default and slow my code, too. That doesn't seem to work.
Also one last note, if you line-by-line run the code it works but execution all at once won't run properly.
I can't test with R as can't seem to get RSelenium set up, but changing the value attribute of the year input box seems to work. In R it looks like there are two ways to do that.
Can't test, but something like:
year <- '02'
#method 1 using inbuilt method which executes js under hood
remDr$findElement('id','txtCaseNo2')$setElementAttribute('value',year)
#method 2 js direct
js <- paste0("document.querySelector('#txtCaseNo2').value='", year,"';")
remDr$executeScript(js)
Anyway, might be enough to get you on track for a solution.
I tested similar versions with Python successfully
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www2.miami-dadeclerk.com/cjis/casesearch.aspx?AspxAutoDetectCookieSupport=1')
case_nums = ["030686"]
year = '02'
d.execute_script("document.querySelector('#txtCaseNo2').value='" + year + "';")
# d.execute_script("arguments[0].value = '02';", d.find_element_by_id('txtCaseNo2'))
d.find_element_by_id('txtCaseNo3').send_keys(case_nums[0])
d.find_element_by_css_selector('[value=F]').click()
captcha = input()
d.find_element_by_id('CaptchaCodeTextBox').send_keys(captcha)
d.find_element_by_id('btnCaseSearch').click()

Using date input via Dateinput to be filename

I have a shiny UI which allows user to select a date via dateinput box. Given output from this will be backup daily hence would like to use such "date", eg 20181224 as part of filename.
library(shiny)
library(shinyFiles)
ui <- fluidPage(
sidebarPanel(
dateInput("COBInput", "Select a Date", value=Sys.Date())
))
server <- function(input,output,session){
COB <- reactive(as.Date(input$COBInput,format="%Y-%m-%d"))
COB2 <- paste(
"Test",as.character(
format(input$COBInput,format="%Y-%m-%d",'%Y')
)
)}
shinyApp(ui,server)
Error that I got :
Listening on http://127.0.0.1:4973
Warning: Error in .getReactiveEnvironment()$currentContext: Operation not allowed
without an active reactive context. (You tried to do something that can only be
done from inside a reactive expression or observer.)
54: stop
53: .getReactiveEnvironment()$currentContext
52: .subset2(x, "impl")$get
51: $.reactivevalues
47: server [N:/AdHoc Query/R/FFVA/DateInputTest/ShinyApp.R#42]
Error in .getReactiveEnvironment()$currentContext() :
Operation not allowed without an active reactive context. (You tried to do something that can only be done from inside a reactive expression or observer.)
I would expect for each day, I could save file with name like "Daily20181224","Daily20181221" etc
I was exactly not clear about the requirements but tried using textoutput which can give you idea about how to generate the filename.
library(shiny)
library(shinyFiles)
ui <- fluidPage(
sidebarPanel(
dateInput("COBInput", "Select a Date", value=Sys.Date()),
textOutput("filename")
))
server <- function(input,output,session){
output$filename<-renderText({
input_date<-input$COBInput
year <- as.numeric(format(input_date,'%Y'))
month<-as.numeric(format(input_date,'%m'))
day<-as.numeric(format(input_date,'%d'))
paste0("Daily",year,month,day)
})
}
shinyApp(ui,server)
I think you can generate the filenames now
One thing I like to say about ShinyFiles - I think you are aware that it can only be used for server side file browsing after deployment.

Libreoffice basic - Associative array

I come from PHP/JS/AS3/... this kind languages. Now I'm learning basic for Libreoffice and I'm kind of struggling to find how to get something similar as associative array I use to use with others languages.
What I'm trying to do is to have this kind structure:
2016 => October => afilename.csv
2016 => April => anotherfilename.csv
with the year as main key, then the month and some datas.
More I try to find informations and more I confuse, so if someone could tell me a little bit about how to organise my datas I would be so pleased.
Thanks!
As #Chrono Kitsune said, Python and Java have such features but Basic does not. Here is a Python-UNO example for LibreOffice Writer:
def dict_example():
files_by_year = {
2016 : {'October' : 'afilename.csv',
'November' : 'bfilename.csv'},
2017 : {'April' : 'anotherfilename.csv'},
}
doc = XSCRIPTCONTEXT.getDocument()
oVC = doc.getCurrentController().getViewCursor()
for year in files_by_year:
for month in files_by_year[year]:
filename = files_by_year[year][month]
oVC.getText().insertString(
oVC, "%s %d: %s\n" % (month, year, filename), False)
g_exportedScripts = dict_example,
Create a file with the code above using a text editor such as Notepad or GEdit. Then place it here.
To run it, open Writer and go to Tools -> Macros -> Run Macro, and find the file under My Macros.
I'm not familiar with LibreOffice (or OpenOffice.org) BASIC or VBA, but I haven't found anything in the documentation for any sort of associative array, hash, or whatever else someone calls it.
However, many modern BASIC dialects allow you to define your own type as a series of fields. Then it's just a matter of using something like
Dim SomeArray({count|range}) As New MyType
I think that's as close as you'll get without leveraging outside libraries. Maybe the Python-UNO bridge would help since Python has such a feature (dictionaries), but I wouldn't know for certain. I also don't know how it would impact performance. You might prefer Java instead of Python for interfacing with UNO, and that's okay too: there's the java.util.HashMap type. Sorry I can't help more, but the important thing to remember is that any BASIC code in tends to live up to the meaning of the word "basic" in English without external assistance.
This question was asked a long ago, but answers are only half correct.
It is true that LibreOffice Basic does not have a native associative array type. But the LibreOffice API provides services. The com.sun.star.container.EnumerableMap service will meet your needs.
Here's an example:
' Create the map
map = com.sun.star.container.EnumerableMap.create("long", "any")
' The values
values = Array( _
Array(2016, "October", "afilename.csv"), _
Array(2016, "April", "anotherfilename.csv") _
)
' Fill the map
For i=LBound(values) to UBound(values)
value = values(i)
theYear = value(0)
theMonth = value(1)
theFile = value(2)
If map.containsKey(theYear) Then
map2 = map.get(theYear)
Else
map2 = com.sun.star.container.EnumerableMap.create("string","string")
map.put(theYear, map2)
End If
map2.put(theMonth, theFile)
Next
' Access to an element
map.get(2016).get("April") ' anotherfilename.csv
As you see, the methods are similar to what you can find in more usual languages.
Beware: if you experience IllegalTypeException you might have to use CreateUNOValue("<type>", ...) to cast a value into the declared type. (See this very old issue https://bz.apache.org/ooo/show_bug.cgi?id=121192 for an example.)

Download all posts for Group I have Admin rights to using Facebook Graph API

We are trying to retrieve ALL the posts, with associated comments and images, made to our group in the last year. I've tried using GraphAPI to do this but pagination means I have to get data, then copy the "next" link, and run again. Unfortunately, this means a LOT of work, since there are over 2 million posts to the group.
Does ANYONE know of a way to do this without spending a few days clicking? Also consider that the group has 4000+ members and is growing everyday, with, on average, about 1000 posts a DAY at the moment.
For the curious, the PLAN is to cull the herd...
I am HOPELESS at programming and have recently started learning Python...
I made it like this, you'll probably have to iterate through all posts until data is empty. Note this is Python 2.x version.
from facepy import GraphAPI
import json
group_id = "YOUR_GROUP_ID"
access_token = "YOUR_ACCESS_TOKEN"
graph = GraphAPI(access_token)
# https://facepy.readthedocs.org/en/latest/usage/graph-api.html
data = graph.get(group_id + "/feed", page=False, retry=3, limit=800)
with open('content.json', 'w') as outfile:
json.dump(data, outfile, indent = 4)
I've just found, and used #dfdfdf 's solution, which is great!
You can generalize it to download from multiple pages of a feed, rather than just the first one, like so:
from facepy import GraphAPI
import json
group_id = "\YOUR_GROUP_ID"
access_token = "YOUR_ACCESS_TOKEN"
graph = GraphAPI(access_token)
pages = graph.get(group_id + "/feed", page=True, retry=3, limit=1000)
i = 0
for p in pages:
print 'Downloading page', i
with open('content%i.json' % i, 'w') as outfile:
json.dump(p, outfile, indent = 4)
i += 1