How to query the github file size in Google BigQuery? - github

I need to get the size statistics for the files in the github open source repository.
For example, the number of files less than 1M is XXX or 70% of the total files.
I found that the files in [bigquery-public-data.github_repos.contents] are all less than 1M(though I don't know why). So I decided to choose [githubarchive:month.202005] or other month.
But I didn't find the "file size" field in [githubarchive:month.202005].So I would like to ask how to query the size of the file in [githubarchive:month.202005]? Then I can use the method in this to get the results by size??
I am new to bigquery, and the question may be silly. But I really need a solution. Or have statistics or literature that I can cite, which has the size statistics for files on github. [bigquery-public-data.github_repos.contents] does not mention why only files less than 1M were selected.

I guess you have a wrong interpretation, since bigquery-public-data.github_repos.content public table holds text file data in content column for items under 1 MiB on the HEAD branch, for others you'll discover just null values:
SELECT id,size,content FROM `bigquery-public-data.github_repos.contents` where size > 1048576 LIMIT 100
Therefore, you are not limited analyzing files inventory in this case if I properly understand your point.

Related

How to calculate filtered file size in google cloud storage

I have a folder hierarchy Bucket/folder/year/month/date/files.ext e.g 2021/12/31/abc.html and 2022/1/1/file1.html etc. The folder contains millions of html files and images. I only want to calculate the sum of size filtered by .html extensions only, the year will start from 2019 to 2022 and for each month and date.
right now what I'm using
gsutil du gs://Bucket/folder/*/*/*/*.html | wc -l
I couldn't find any better solution it is taking too long and gives the connection to your Google Cloud Shell was lost. And second thing is that I want to delete all html files in 2019/1/1/file1.html
Unfortunately, I think you're already looking at the right answer. GCS doesn't provide any sort of index that'll quickly calculate total file size by file type.
Cloud Shell will time out after some minutes of inactivity, or after 24 total hours, so if you have millions of files and need this to complete, I would suggest starting a small GCE instance and running the command from there, or running gsutil from your own machine.

Disappearance of dataset rows when is built a recipe

I upload the dataset into the storage of google cloud ai. Next, I open the flow in dataprep and put there the dataset. When I made the first recipe (without any step already) the dataset has approximately half of its original rows, that is, 36 234 instead of 62 948.
I would like to know what could be causing this problem. Some missing configuration?
Thank you very much in advance
Here are a couple thoughts . . .
Data Sampling
Keep in mind that what's shown in the Dataprep editor is typically a sample of the data, not the full data (unless its very small). If the full file was small enough to load, you should see the "Full Data" label up where the sample is typically shown:
In other cases, what you're actually looking at is a sample, which will also be indicated:
It's very beneficial to have an idea of how Dataprep's sampling works if you haven't reviewed the documentation already:
https://cloud.google.com/dataprep/docs/html/Overview-of-Sampling_90112099
Compressed Sources:
Another issue I've noticed occasionally is when loading compresses CSVs. In this case, I've had the interface tell me that I'm looking at the "Full Data"—but the number of rows is incorrect. However, any time this has happened the job does actually process the full number of rows.

Questions after reading the API doc of upload session

I'm a bit confused after reading this doc.
The doc says:
The fragments of the file must be uploaded sequentially in order. Uploading fragments out of order will result in an error.
Does that mean that, for one file divided into #1~10 fragments in order, I can only upload fragment 2 after I finish uploading fragment 1? If so, why is it possible to have multiple nextExpectedRanges? I mean, if you upload fragments one by one, you can make sure that previous fragments have already been uploaded.
According to the doc, byte range size has to be a multiple of 320 KB. Does that imply that the total file size has to be a multiple of 320 KB also?
There are currently some limitations that necessitate this sequencing requirement, however the long-term goal is to not. As a result, the API reflects this by supporting multiple nextExpectedRanges, but does not currently leverage it.
No, multiples of 320KiB are just the ideal size. You can choose others, and you can mix them. So for you scenario you could use all 320KiB chunks, except for the last one which would be whatever size is relevant to hit the overall size of your file.

AEM6 repository datastore path - CRXDE asset path

How to find out the CRXDE path of a certain binary stored asset (in the datastore) ? e.g.
/crx-quickstart/repository/repository/datastore/ff/ff/ba/ffffbaad37ed026b4e3b22ecba70088f670d36e7
Background of the question:
We create a daily backup of the AEM6 instance. This backup growed from one day to the other from around 65 GB to around 140 GB, more than double size!
Audit logs didn't tell us anything about the change.
We compared the /crx-quickstart/repository/repository/datastore folder of both days and found out that some of the datastore folders increases by more than a GB.
All size differences in all these folders together sum up moreless to this 85 GB diff, so that matches (some minor diffs in folders outside of datastore).
Now we'd like to find out which asset relates to which folder in the datastore to know which assets caused the size increase.
Another option would be to have a good XPATH or SQL query to get all large assets that changed during that time.
But we can not take the creation date because the assets were probably uploaded to the system per package and then the assets have the original creation date of the source environment (where we don't have access to).
But in the package manager there are NOT any packages of the date around this huge backup size increase.

Can I tell Lucene.net not to create files over a certain size?

I have an upload limit on file sizes of 100MB but my index files are coming out larger than that. Is there an option somewhere I can set so that at a specific file size the index will spill out into a secondary file?
Sort of. From Java Lucene FAQ:
Is there a way to limit the size of an index?
This question is sometimes brought up because of the 2GB file size
limit of some 32-bit operating systems.
This is a slightly modified answer from Doug Cutting:
The easiest thing is to use IndexWriter.setMaxMergeDocs().
If, for instance, you hit the 2GB limit at 8M documents set
maxMergeDocs to 7M. That will keep Lucene from trying to merge an
index that won't fit in your filesystem. It will actually effectively
round this down to the next lower power of Index.mergeFactor.
So with the default mergeFactor set to 10 and maxMergeDocs set to 7M
Lucene will generate a series of 1M document indexes, since merging 10
of these would exceed the maximum.