Collecting usage data about internet - collect

Where do sources such as alexa, compete, etc collect their data from to build internet statistics such as the top websites and websites' most visited from countries list?

These sites gather raw data by tracking behavior of their users for data-collection, and performing some statistical fudging to get traffic estimates.
In Alexa's case, it gathers from Alexa toolbar users' data. Compete does the same, but with a larger and more broadly defined "panel" of users.

Related

Need advice: How to share a potentially large report to remote users?

I am asking for advice on possibly better solutions for the part of the project I'm working on. I'll first give some background and then my current thoughts.
Background
Our clients can use my company's products to generate potentially large data sets for use in their industry. When the data sets are generated, the clients will file a processing request to us.
We want to send the clients a summary email which contains some statistical charts as well as sampling points from the data sets so they can do some initial quality control work. If the data sets are of bad quality, they don't need to file any request.
One problem is that the charts and sampling points can be potentially too large to be sent in an email. The charts and the sampling points we want to include in the emails are pictures. Although we can use low-quality format such as JPEG to save space, we cannot control how many data sets would be included in the summary email, so the total size could still exceed the normal email size limit.
In terms of technologies, we are mainly developing in Python on Ubuntu 14.04.
Goals of the Solution
In general, we want to present a report-like thing to the clients to do some initial QA. The report may contains external links but does not need to be very interactive. In other words, a static report should be fine.
We want to reduce the steps or things that our clients must do to read the report. For example, if the report can be just an email, the user only needs to 1). log in and 2). open the email. If they use a client software, they may skip 1). and just open and begin to read.
We also want to minimize the burden of maintaining extra user accounts for both us and our clients. For example, if the solution requires us to register a new user account, this solution is, although still acceptable, not ranked very high.
Security is important because our clients don't want their reports to be read by unauthorized third parties.
We want the process automated. We want the solution to provide programming interface so that we can automate the report sending/sharing process.
Performance is NOT a critical issue. Our user base is not large. I think at most in hundreds. They also don't generate data that frequently, at most once a week. We don't need real-time response. Even a delay of a few hours is still acceptable.
My Current Thoughts of Solution
Possible solution #1: In-house web service. I can set up a server machine and develop our own web service. We put the report into our database and the clients can then query via the Internet.
Possible solution #2: Amazon Web Service. AWS is quite mature but I'm not sure if they could be expensive because so far we just wanna share a report with our remote clients which doesn't look like a big deal to use AWS.
Possible solution #3: Google Drive. I know Google Drive provides API to do uploading and sharing programmatically, but I think we need to register a dedicated Google account to use that.
Any better solutions??
You could possibly use AWS S3 and Cloudfront. Files can easily be loaded into S3 using the AWS SDK's and API. You can then use the API to generate secure links to the files that can only be opened for a specific time and optionally from a specific IP.
Files on S3 can also be automatically cleaned up after a specific time if needed using lifecycle rules.
Storage and transfer prices are fairly cheap with AWS and remember that the S3 storage cost indicated is by the month so if you only have an object loaded for a few days then you only pay for a few days.
S3: http://aws.amazon.com/s3/pricing
Cloudfront: https://aws.amazon.com/cloudfront/pricing/
Here's a list of the SDK's for AWS:
https://aws.amazon.com/tools/#sdk
Or you can use their command line tools for Windows batch or powershell scripting:
https://aws.amazon.com/tools/#cli
Here's some info on how the private content urls are created:
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/PrivateContent.html
I will suggest to built this service using mix of your #1 and #2 options. You can do the processing and for transferring the data leverage AWS S3 which is quiet cheap.
Example: 100GB costs like approx $3.
Also AWS S3 will be beneficial as you are covered for any disaster on your local environment your data will be safe in S3.
For security you can leverage data encryption and signed URLS in AWS S3.

Using Firebase for chat applications does it scale?

Given Firebase supports realtime push for web, android and iOS, I was tempted to try it out like pub/sub type push based system. Does anyone know if the firebase works and scale to large userbase? for example million+ users of chat conversations and users accessing that data simultaneously, what would be response time as users grow.
Firebase free and basic plans limit connections to 100. Their higher end plans limit connections to 10,000. That's done to prevent abuse.
From the Firebase website in case it was overlooked:
Those restrictions can be lifted though, and their Enterprise solutions can scale to millions of connections and terabytes of data
I cannot speak from personal experience with a million users but given the performance we have seen, it would be scalable to handle that.
Based on the scope of your project, I would engage them directly at support#firebase.com with your specific requirements.

is there a service or software for metering data downloads

We have a range of web applications here that allow users to download selected data from a number of databases and online services. Mainly Environmental information. We can track users visiting web pages using tools like Piwik or Google Analytics. We also want to track the amount of resource or data that they use, possibly also applying limits to record downloads.
If this was a single DB system we could track rows delivered within the db. However here we have a SOA with a range of sources and sinks. What I envisage is a service that can be messaged by other systems to register or track the amount of a resource used.
e.g User Andrew was sent 125MB of water quality data.
The central data metering service tracks usage messages from a variety of sources, produces reports and where appropriate applies caps or billing limits.
This service might be expanded to include processing as well as data download.
I would consider this to be a not unusual requirement but I can't find much in the way of existing software for it - perhaps because I am not using the correct terminology.
SO my questions:
What would you call this service - what keywords will lead me to existing systems?
What solutions already exist in this area - in particular FOSS or cloud based systems?
Could something like Google Analytics be persuaded to operate in this fashion?
It would be possible to do with the measurement protocol from Google Universal Analytics in conjunction with the user id feature in Analytics and one or more custom dimensions.
The measurement protocol is a language agnostic vaguely REST-like (inasfar as you send a bunch of parameters to an endpoint) protocol to send tracking data to the Google servers.
User id is a feature to recognize authenticated users across devices and multiple visits.
If the various parts of your setup send http calls build to the measurement protocol and include the user id to recognize the user and a value for a custom dimension for the file size (or rather a custom metric if you want to have sums and averages) and maybe a custom dimension for the file name you can send this to you Analytics account and build a custom report for downloads.
Note that the user id is an internal id that is used to link together visits by the same user from multiple devices - it is not something that shows up in the reports that would allow you to report on individual users in the Analytics interface (if you want that you need to include another id as custom dimension, and you have to check with the Google TOS what kind of id is allowed). Plus you'd need a dedicated data view in GA for sessions with a user id which will not show unauthenicated users.

How to balance REST api and Openedness to prevent data stealing

One of our web site is a common "Announce for free your apartment".
Revenues are directly associated to number of public usage and announces
registered (argument of our marketing department).
On the other side, REST pushes to maintain a clear api when designing your
api (argument of our software department) which is a data stealing
invitation to any competitors. In this view, the web server becomes
almost an intelligent database.
We clearly identified our problem, but have no idea how to resolve these
contraints. Any tips would help?
Throttle the calls to the data rich elements by IP to say 1000 per day (or triple what a normal user would use)
If you expose data then it can be stolen. And think about search elements that return large datasets even if they are instigated by javascript or forms - I personally have written trawlers that circumvent these issues.
You may also think (if data is that important) about decrypting it in the client based on keys and authentication sent from the server (but this only raises the bar not the ability to steal.
Add captcha/re-captcha for users who are scanning too quickly or too much.
In short:
As always only expose the minimum API to do the job (attack surface minimisation)
Log and throttle
Force sign in(?). This at least MAY put off some scanners
Use capthca mechanism for users you think may be bots trawling your data

Amazon S3 + CloudFront Queries

I am currently making a social sharing like app and I encounter a problem.
First off, S3 in my experience is slow, so I need to sync the data for multiple servers around the world to make it faster for multiple users.
So my question is, I need to create multiple buckets for each country right? Amazon has a list of their server locations. So for each user, I calculate the nearest server than upload there? How?
Next question, in my app people can subscribe to others and check for their updates. So realistically, this would not create a speed difference. If someone in Singapore uploaded a piece of text and has a subscriber in United States, it wouldn't be any quicker for this subscriber because he has to download a piece of text stored all the way in the Singapore.
All of this is making me confused! I personally find S3 very slow, which is why I am using CloudFront.
Any help? Am I misunderstanding the process? Thanks!
Buckets are not per country, they are per region (EU, US, Asia, etc.)
Secondly, you do not have to manage closest URL to your S3 buckets, that's what CloudFront is for, you just get a single URL for each bucket and CloudFront will manage routing the user's request to the closest edge location.
PS: In addition, Amazon replicates data uploaded to your bucket across all edge locations transparently.
Amazon in no way "automatically" replicates your content out to the edge locations. Instead, your content is copied to a single edge location, if (and only) if the content is not there (could be the first pull, could be it's expired) when a user tries to access it from that edge. It is a pull mechanism, not a push. See "Download Distributions for HTTP Delivery" section of http://aws.amazon.com/cloudfront/