I need to do a very large search on Github for a statistic in my thesis.
For example, I need to explore a large number of Android projects on GitHub, but the site limits the search result to 1000 (ex. https://github.com/search?l=java&q=onCreate&ref=searchresults&type=Code&utf8=%E2%9C%93). Also using the Java GitHub API I tried the library org.eclipse.egit.github.core.client.GitHubClient using the method GitHubClient.searchRepositories() but even there the number of results is limited.
Does anyone know how to get all results?
The Search API will return up to 1000 results per query (including pagination), as documented here:
https://developer.github.com/v3/search/#about-the-search-api
However, there's a neat trick you could use to fetch more than 1000 results when executing a repository search. You could split up your search into segments, by the date when the repositories were created. For example, you could first search for repositories that were created in the first week of October 2013, then second week, then September, and so on.
Because you would be restricting search to a narrow period, you will probably get less than 1000 results, and would therefore be able to get all of them. In case you notice that more than 1000 results are returned for a period, you would have to narrow the period even more, so that you can collect all results.
https://help.github.com/articles/searching-repositories/#search-based-on-when-a-repository-was-created-or-last-updated
You should be able to automate this via the API.
If you are searching for all files in Github with filename:your-file-name, you could also slice it with a query attribute : size.
For example, you are looking for all files named test.rb in Github, Github API may return more than 11M results, but you could only get 1000 of them because the GitHub Search API provides up to 1,000 results for each search. An url like : https://api.github.com/search/code?q=filename:test.rb+size:1000..1500 would be able to slice your search by changing size range.
Related
I have been searching Github documentation well as pygithub documentation as well on how I could get the stats for each users committed and merged lines of code into the master branch from a specific date. So far the best i could find is under contributions it list out a users committed lines of codes however this gives the stats for the life of the project but i need to filter this by a specific date. Is there anyway to do this appreciate the help.
It looks like you can pretty easily retrieve a list of the commits from a specific user and in agiven date range using the pygithub Repository get_commits method. You can see from the method signature below that you can filter by the hash, path, date range, and author.
def get_commits(
self,
sha=github.GithubObject.NotSet,
path=github.GithubObject.NotSet,
since=github.GithubObject.NotSet,
until=github.GithubObject.NotSet,
author=github.GithubObject.NotSet,
)
I'm querying GA Report API v4 to get some metrics for AdWords Keywords.
As dimension I use:
ga:keyword
As metrics I use:
ga:adClicks,
ga:adCost,
ga:CPC,
ga:sessions,
ga:bounceRate,
ga:pageviewsPerSession,
ga:goalConversionRateAll,
ga:transactions,
ga:transactionRevenue
When I compare results pulled from API with results that I'm getting by Google Analytics UI, I found out that certain metrics in some Keywords has tiny differences.
Also when I tried GA API v3 I had same result.
What is the reason?
Why some returned metrics for Keywords are fully identical to results in UI, but certain not?
I tried various date ranges: 1 day, week, month but in all cases I got some tiny differences in some metrics of certain Keywords.
Here is screenshot with example of differences in metrics how it looks like:
In red color means the difference, green color - means that values are identical
Problem: The reason for the discrepancy is that you are calling two different reports.
Report 1) UI Report.
As you have seen, this report is made up of two parts the first being Clicks, Cost, and CPC which are from the Google AdWords API, and the other metrics (sessions, bounce, etc.) which is from Google Analytics.
Because you are going into AdWords > Keywords, you are actually setting a filter to select only AdWords traffic.
Report 2) Custom Report.
This report is pulling the keywords dimension without any filters. This means that the report will also have data for organic keywords, and any UTM_term parameters set.
Because sessions from organic keywords have no AdWords data, the first three columns will be the same, however the Google Analytics specific columns will show variation in the metrics.
Solution:
To get your reports the same, you need to add a filter to your API request, such as ga:adwordsCustomerID or ga:source=google & ga:medium=cpc.
For PowerApps, what data source, other than SharePoint lists are accessible via Powershell?
There are actually two issues that I am dealing with. The first is dynamic updating and the second is the 500 item limit that SharePoint lists are subject to.
I need to dynamically update my data source, which I am currently doing with PowerShell. My data source is not static and updating records by hand is time-consuming and error prone. The driving force behind my question is that the SharePoint list view threshold is 5,000 records however you are limited to 500 visible and searchable records when using SharePoint lists in the Gallery View and my data source contains greater than 500 but less than 1000 records. If you have any items beyond the 500th record that should match the filter criteria, they will not be found. So SharePoint lists are not optional for me until that limitation is remediated
Reference: https://powerapps.microsoft.com/en-us/tutorials/function-filter-lookup/
To your first question, Powershell can be used for almost anything on the Microsoft stack. You could use SQL server, Dynamics 365, SP, Azure, and in the future there will be an SDK for the Common Data Service. There are a lot of connectors, and Powershell can work with a good majority of them.
Take note that working with these data structures through Powershell is independent from Powerapps. Powerapps just takes the data that the data connector gives it, and if you have something updating the data in the background (Powershell, cron job, etc.), In order to get a dynamic list of items, you can use a Timer control and a Refresh function on your data source to update the list every ~5-20 seconds.
To your second question about SharePoint, there is an article that came out around the time you asked this regarding working with large lists. I wouldn't say it completely solves your question, but this article seems to state using the "Filter" function on basic column types would possibly work for you:
...if you’d like to filter the set of items that you are showing in the gallery control, you will make use of a “Filter” expression, rather than the “Search” expression, which is the default that existing apps used. With our changes, SharePoint connector now supports “equals” type of queries on columns that support filtering (Single line of text, choice, numbers, dates and people), so make sure that the columns and the expressions you use are supported and watch for the same warning to avoid reverting back to the top 500 items.
It also notes that if you want to pull from a list larger than the 5k threshold, you would need to use indexes, I have not fully tested this yet but it seems that this could potentially solve your problem.
I am trying to get the current weather conditions for a certain location, but for some reason I always get the conditions for a semi-random day/time (more or less past week). I am using this query: https://query.yahooapis.com/v1/public/yql?q=select%20item.condition%20from%20weather.forecast%20where%20woeid%20%3D%20639660%20AND%20u%3D%22c%22&format=json&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback=
Even trying out the example query on their website doesn't work for me. When I click on "Current conditions for San Diego, CA" I get the same, random results. Is there any way to get the current conditions?
I am seeing the same thing. It has to be on Yahoo's side. Unfortunately it looks like their support page is down too. Returned a 404
I've also notice the random ( 1 day up to 2 wks) of old data being transmitted
http://xml.weather.yahoo.com/forecastrss/55364_f.xml It seems to change about every 30 seconds
We have a metadata-and-url feed and a content feed in our project. The indexing behaviour of the documents submitted using either feed is completely unpredictable. For the content feed, the documents get removed from the index after a random interval every time. For the metadata-and-url feed, the additional metadata we add is ignored, again randomly. The documents themselves do remain in index in the latter case - only our custom metadata gets removed. Basically, it looks like the feeds get "forgotten" by GSA after sometime. What could be the cause of this issue, and how do we go about debugging this?
Points to note:
1) Due to unavoidable reasons, our GSA index is always hovering around the license limit (+/- 1000 documents or so). Could this have an effect? Are feeds purged when nearing license limit? We do have "lock = true" set in the feed records though.
2) These fed documents are not linked to from pages and hence (I believe) would have low page rank. Are feeds automatically purged if not linked to from pages?
3) Our follow patterns include the fed documents.
4) We do not use action=delete with the same documents, so that possibility is ruled out. Also for the content feed we always post all the documents. So they are not removed through feeds.
When you hit the license limit the GSA will start dropping documents from the index so I'd say that's definitely your problem.