How can I see all characters in a unicode category? - unicode

I've read the documentation and can't find any examples.
http://golang.org/pkg/unicode/#IsPunct
Is there a place in the documentation that explicitly lists all characters in these categories? I'd like to see what characters are contained in category P or category M.

It's not in the documentation, but you can still read the source code. The categories you're talking about are defined in this file: http://golang.org/src/pkg/unicode/tables.go
For example, the P category is defined this way:
2029 var _P = &RangeTable{
2030 R16: []Range16{
2031 {0x0021, 0x0023, 1},
2032 {0x0025, 0x002a, 1},
2033 {0x002c, 0x002f, 1},
2034 {0x003a, 0x003b, 1},
2035 {0x003f, 0x0040, 1},
2036 {0x005b, 0x005d, 1},
2037 {0x005f, 0x007b, 28},
...
2141 {0xff5d, 0xff5f, 2},
2142 {0xff60, 0xff65, 1},
2143 },
2144 R32: []Range32{
2145 {0x10100, 0x10102, 1},
2146 {0x1039f, 0x103d0, 49},
2147 {0x10857, 0x1091f, 200},
...
2157 {0x12470, 0x12473, 1},
2158 },
2159 LatinOffset: 11,
2160 }
And here is a simple way to print all of them:
var p = unicode.Punct.R16
for _, r := range p {
for c := r.Lo; c <= r.Hi; c += r.Stride {
fmt.Print(string(c))
}
}

There are a number of web sites that present an interface to the Unicode character database. For example see the “Punctuation, ...” categories at http://www.fileformat.info/info/unicode/category/.

Related

MongoDB query to compute percentage

I am new to MongoDB and kind of stuck at this query. Any help/guidance will be highly appreciated. I am not able to calculate the percentage in the desired way. There is something wrong with my pipeline where prerequisites of percentage are not computed correctly. Following I provide my unsuccessful attempt along with the desired output.
Single entry in the collection looks like below:
_id : ObjectId("602fb382f060fff5419fd0d1")
time : "2019/05/02 00:00:00"
station_id : 3544
station_name : "Underhill Ave &; Pacific St"
station_status : "In Service"
latitude : 40.6804836
longitude : -73.9646795
zipcode : 11238
borough : "Brooklyn"
neighbourhood : "Prospect Heights"
available_bikes : 5
available_docks : 21
The query I am trying to solve is:
Given a station_id (e.g., 522) and a num_hours (e.g., 3) passed as parameters:
- Consider only the measurements where the station_status = “In Service”.
- Consider only the measurements for that concrete
“station_id”.
- Compute the percentage of measurements with
available_bikes = 0 for each hour of the day (e.g., for the period
[8am, 9am) the percentage is 15.06% and for the period [9am, 10am)
the percentage is
27.32%).
- Sort the percentage results in decreasing order.
- Return the top “num_hours” documents.
The desired output is:
--- DOCUMENT 0 INFO ---
---------------------------------
hour : 19
percentage : 65.37
total_measurements : 283
zero_bikes_measurements : 185
---------------------------------
--- DOCUMENT 1 INFO ---
---------------------------------
hour : 21
percentage : 64.79
total_measurements : 284
zero_bikes_measurements : 184
---------------------------------
--- DOCUMENT 2 INFO ---
---------------------------------
hour : 00
percentage : 63.73
total_measurements : 284
zero_bikes_measurements : 181
My attempt is:
command_1 = {"$match": {"station_status": "In Service", "station_id": station_id, "available_bikes": 0}}
my_query.append(command_1)
command_2 = {"$group": {"_id": "null", "total_measurements": {"$sum": 1}}}
my_query.append(command_2)
command_3 = {"$project": {"_id": 0,
"station_id": 1,
"station_status": 1,
"hour": {"$substr": ["$time", 11, 2]},
"available_bikes": 1,
"total_measurements": {"$sum": 1}
}
}
my_query.append(command_3)
command_4 = {"$group": {"_id": "$hour", "zero_bikes_measurements": {"$sum": 1}}}
my_query.append(command_4)
command_5 = {"$project": {"percent": {
"$multiply": [{"$divide": ["$total_measurements", "$zero_bikes_measurements"]},
100]}}}
my_query.append(command_5)
I've taken a look at this and I'm going to offer some sincere advice:
Don't try and do this in an aggregate query. Just go back to basics and pull the numbers out using find()s and then calculate the numbers in python.
If you want to persist with an aggregate query, I will say that your match command filters on available_bikes equal to zero. You never have the total number of measurements, so you can never find the percentage. Also when you have done your first $group, your "lose" your projection so at that point in the pipeline you only have total_measurements and that's it (comment out the commands 3 to 5 to see what I mean).

How do you get directory listing sorted by date in Elixir?

How do you get directory listing sorted by date in Elixir?
File.ls/1 gives list sorted by filename only.
No other functions in File module seem relevant for this.
Maybe there's a built-in function I don't know about, but you can make your own by using File.stat!/2:
File.ls!("path/to/dir")
|> Enum.map(&{&1, File.stat!("path/to/dir" <> &1).ctime})
|> Enum.sort(fn {_, time1}, {_, time2} -> time1 <= time2 end)
Example output:
[
{"test", {{2019, 3, 9}, {23, 55, 48}}},
{"config", {{2019, 3, 9}, {23, 55, 48}}},
{"README.md", {{2019, 3, 9}, {23, 55, 48}}},
{"_build", {{2019, 3, 9}, {23, 59, 48}}},
{"test.xml", {{2019, 3, 23}, {22, 1, 28}}},
{"foo.ex", {{2019, 4, 20}, {4, 26, 5}}},
{"foo", {{2019, 4, 21}, {3, 59, 29}}},
{"mix.exs", {{2019, 7, 27}, {8, 45, 0}}},
{"mix.lock", {{2019, 7, 27}, {8, 45, 7}}},
{"deps", {{2019, 7, 27}, {8, 45, 7}}},
{"lib", {{2019, 7, 27}, {9, 5, 36}}}
]
Edit:
As pointed out in a comment, this assumes you're in the directory you want to see the output for. If this is not the case, you can specify it by adding the :cd option, like so:
System.cmd("ls", ["-lt"], cd: "path/to/dir")
You can also make use of System.cmd/3 to achieve this.
Particularly you want to use the "ls" command with the flag "-t" which will sort by modification date and maybe "-l" which will provide extra information.
Therefore you can use it like this:
# To simply get the filenames sorted by modification date
System.cmd("ls", ["-t"])
# Or with extra info
System.cmd("ls", ["-lt"])
This will return a tuple containing a String with the results and a number with the exit status.
So, if you just call it like that, it will produce something like:
iex> System.cmd("ls", ["-t"])
{"test_file2.txt\ntest_file1.txt\n", 0}
Having this, you can do lots of things, even pattern match over the exit code to process the output accordingly:
case System.cmd("ls", ["-t"]) do
{contents, 0} ->
# You can for instance retrieve a list with the filenames
String.split(contents, "\n")
{_contents, exit_code} ->
# Or provide an error message
{:error, "Directory contents could not be read. Exit code: #{exit_code}"
end
If you don't want to handle the exit code and just care about the contents you can simply run:
System.cmd("ls", ["-t"]) |> elem(0) |> String.split("\n")
Notice that this will however include an empty string at the end, because the output string ends with a newline "\n".

How to trigger events by count in windowing for apache beam?

I appreciate if I could get help with windowing in apache beam 2.13.0 .
I use python 3.7.3 .
[ywatanabe#fedora-30-00 bdd100k-to-es]$ python3 -V
Python 3.7.3
I want to do exactly what this example is doing against bounded data though. Which is to group events by each trigger and pass it to next transform.
8.4.1.2. Discarding mode
If our trigger is set to discarding mode, the trigger emits the following values on each firing:
First trigger firing: [5, 8, 3]
Second trigger firing: [15, 19, 23]
Third trigger firing: [9, 13, 10]
Referencing the example I have wrote my code as below,
es = (gps | 'window:gps' >> WindowInto(
FixedWindows(1 * 60),
trigger=Repeatedly(
AfterAny(
AfterCount(1000000),
AfterProcessingTime(1 * 60)
)
),
accumulation_mode=AccumulationMode.DISCARDING
)
| 'bulk:gps' >> beam.ParDo(BulkToESFn(esHost), tag_gps))
However in above code , it looks like the trigger is fired almost every millisec instead of every minute or 1,000,000 events.
2019-07-15 20:13:20,401 INFO Sending bulk request to elasticsearch. Doc counts: 11 Docs: {'track_id': '514df98862de83a07e7aff62dff77c3d', 'media_id': 'afe35b87-0a9acea6', 'ride_id': 'afe35b87d0b69e1928dd0a4fd75a1416', 'filename': '0a9acea6-62d6-4540-b048-41e34e2407c6.mov', 'timestamp': 1505287487.0, 'timezone': 'America/Los_Angeles', 'coordinates': {'lat': 37.786611081350365, 'lon': -122.3994713602353}, 'altitude': 16.06207275390625, 'vertical_accuracy': 4.0, 'horizantal_accuracy': 10.0, 'speed': 2.3399999141693115}
2019-07-15 20:13:20,403 INFO Sending bulk request to elasticsearch. Doc counts: 11 Docs: {'track_id': '514df98862de83a07e7aff62dff77c3d', 'media_id': 'afe35b87-0a9acea6', 'ride_id': 'afe35b87d0b69e1928dd0a4fd75a1416', 'filename': '0a9acea6-62d6-4540-b048-41e34e2407c6.mov', 'timestamp': 1505287488.0, 'timezone': 'America/Los_Angeles', 'coordinates': {'lat': 37.78659459994027, 'lon': -122.39945105706596}, 'altitude': 15.888671875, 'vertical_accuracy': 4.0, 'horizantal_accuracy': 10.0, 'speed': 2.3299999237060547}
2019-07-15 20:13:20,406 INFO Sending bulk request to elasticsearch. Doc counts: 11 Docs: {'track_id': '514df98862de83a07e7aff62dff77c3d', 'media_id': 'afe35b87-0a9acea6', 'ride_id': 'afe35b87d0b69e1928dd0a4fd75a1416', 'filename': '0a9acea6-62d6-4540-b048-41e34e2407c6.mov', 'timestamp': 1505287489.0, 'timezone': 'America/Los_Angeles', 'coordinates': {'lat': 37.78657796009011, 'lon': -122.39943055871701}, 'altitude': 15.741912841796875, 'vertical_accuracy': 4.0, 'horizantal_accuracy': 10.0, 'speed': 2.549999952316284}
Do I need any other option for windowing for this case ?
I think window strategy and trigger strategy takes effect at the GBK step.
https://beam.apache.org/documentation/programming-guide/#windowing
In your case I think you can implement the DoFn(BulkToESFn) in a way that it buffers data and only writes to ES when count exceed a predefined value.
class BulkToESFn(DoFn):
def __init__(self,
batch_size=1000000):
self.batch_size = batch_size
self.batch = []
def finish_bundle(self):
self._flush()
def process(self, element, *args, **kwargs):
self.batch.append(element)
if len(self.batch) >= self.batch_size:
self._flush()
def _flush(self):
writeToES(self.batch)
self.batch = []

Determining whitespace in Go

From the documentation of Go's unicode package:
func IsSpace
func IsSpace(r rune) bool
IsSpace reports whether the rune is a space character as defined by Unicode's White Space property; in the Latin-1 space this is
'\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP).
Other definitions of spacing characters are set by category Z and property Pattern_White_Space.
My question is: What does it mean that "other definitions" are set by the Z category and Pattern_White_Space? Does this mean that calling unicode.IsSpace(), checking whether a character is in the Z category, and checking whether a character is in Pattern_White_Space will all yield different results? If so, what are the differences? And why are there differences?
The IsSpace function will first check if your rune is in the Latin1 char space. If it is, it will use the space characters you listed to determine white-spacing.
If not, isExcludingLatin (http://golang.org/src/unicode/letter.go?h=isExcludingLatin#L170) is called which looks like:
170 func isExcludingLatin(rangeTab *RangeTable, r rune) bool {
171 r16 := rangeTab.R16
172 if off := rangeTab.LatinOffset; len(r16) > off && r <= rune(r16[len(r16)-1].Hi) {
173 return is16(r16[off:], uint16(r))
174 }
175 r32 := rangeTab.R32
176 if len(r32) > 0 && r >= rune(r32[0].Lo) {
177 return is32(r32, uint32(r))
178 }
179 return false
180 }
The *RangeTable being passed in is White_Space which looks is defined here:
http://golang.org/src/unicode/tables.go?h=White_Space#L6069
6069 var _White_Space = &RangeTable{
6070 R16: []Range16{
6071 {0x0009, 0x000d, 1},
6072 {0x0020, 0x0020, 1},
6073 {0x0085, 0x0085, 1},
6074 {0x00a0, 0x00a0, 1},
6075 {0x1680, 0x1680, 1},
6076 {0x2000, 0x200a, 1},
6077 {0x2028, 0x2029, 1},
6078 {0x202f, 0x202f, 1},
6079 {0x205f, 0x205f, 1},
6080 {0x3000, 0x3000, 1},
6081 },
6082 LatinOffset: 4,
6083 }
To answer your main question, the IsSpace check is not limited to Latin-1.
EDIT
For clarification, if the character you are testing is not in the Latin-1 charset, then the range table lookup is used. The Range16 values in the table represent ranges of 16bit numbers {Low, Hi, Stride}. The isExcludingLatin will call is16 with that range table sub-section (R16) and determine if the rune provided falls in any of the ranges after the index of LatinOffset (which is 4 in this case).
So, that is checking these ranges:
{0x1680, 0x1680, 1},
{0x2000, 0x200a, 1},
{0x2028, 0x2029, 1},
{0x202f, 0x202f, 1},
{0x205f, 0x205f, 1},
{0x3000, 0x3000, 1},
There are unicode code points for:
http://www.fileformat.info/info/unicode/char/1680/index.htm
http://www.fileformat.info/info/unicode/char/2000/index.htm
http://www.fileformat.info/info/unicode/char/2001/index.htm
http://www.fileformat.info/info/unicode/char/2002/index.htm
http://www.fileformat.info/info/unicode/char/2003/index.htm
http://www.fileformat.info/info/unicode/char/2004/index.htm
http://www.fileformat.info/info/unicode/char/2005/index.htm
http://www.fileformat.info/info/unicode/char/2006/index.htm
http://www.fileformat.info/info/unicode/char/2007/index.htm
http://www.fileformat.info/info/unicode/char/2008/index.htm
http://www.fileformat.info/info/unicode/char/2009/index.htm
http://www.fileformat.info/info/unicode/char/200a/index.htm
http://www.fileformat.info/info/unicode/char/2028/index.htm
http://www.fileformat.info/info/unicode/char/2029/index.htm
http://www.fileformat.info/info/unicode/char/202f/index.htm
http://www.fileformat.info/info/unicode/char/205f/index.htm
http://www.fileformat.info/info/unicode/char/3000/index.htm
All of the above are considers "white space"

mongodb: issues using $lte and $gte

look at this bizarre result:
list(db.users.find({"produit_up.spec.prix":{"$gte":0, "$lte": 1000}}, {"_id":0,"produit_up":1}))
Out[5]:
[{u'produit_up': [{u'avatar': {u'avctype': u'image/jpeg',
u'orientation': u'portrait',
u'photo': ObjectId('506867863a5f3a0ea84dcd6c')},
u'spec': {u'abus': 0,
u'date': u'2012-09-30',
u'description': u"portable tr\xe8s solide, peu servi, avec batterie d'une autonomie de 3 heures.",
u'id': u'alucaard134901952647',
u'namep': u'nokia 3310',
u'nombre': 1,
u'prix': 1000,
u'tags': [u'portable', u'nokia', u'3310'],
u'vendu': False}},
{u'avatar': {u'avctype': u'image/jpeg',
u'orientation': u'portrait',
u'photo': ObjectId('50686d013a5f3a04a8923b3e')},
u'spec': {u'abus': 0,
u'date': u'2012-09-30',
u'description': u'\u0646\u0628\u064a\u0639 \u0623\u064a \u0641\u0648\u0646 \u062c\u062f\u064a\u062f \u0641\u064a \u0627\u0644\u0628\u0648\u0627\u0637 \u0645\u0639\u0627\u0647 \u0634\u0627\u0631\u062c\u0648\u0631 \u062f\u0648\u0631\u064a\u062c \u064a\u0646',
u'id': u'alucaard134902092967',
u'namep': u'iphone 3gs',
u'nombre': 1,
u'prix': 20000,
u'tags': [u'iphone', u'3gs', u'apple'],
u'vendu': False}},
{u'avatar': {u'avctype': u'image/jpeg',
u'orientation': u'paysage',
u'photo': ObjectId('50686d3e3a5f3a04a8923b40')},
u'spec': {u'abus': 0,
u'date': u'2012-09-30',
u'description': u'vends 206 toutes options 2006 hdi.',
u'id': u'alucaard134902099082',
u'namep': u'peugeot 206',
u'nombre': 1,
u'prix': 500000,
u'tags': [u'voiture', u'206', u'hdi'],
u'vendu': False}}]}]
list(db.users.find({"produit_up.spec.prix":{"$gte":0, "$lte": 100}}, {"_id":0,"produit_up":1}))
Out[6]: []
pymongo.version
Out[8]: '2.3+'
and it gives me the same result in Mongo Shell:
db.version()
2.2.0
here is the answer from Bernie Hackett
You have three values for "produit_up.spec.prix", 1000, 20000, 500000.
Why would you think that {"$gte":0, "$lte": 100} would match any of
those values? 100 is less than all of those values.
The reason that {"$gte":0, "$lte": 1000} returns all three documents
is that they are all subdocuments in an array. Since one of the
subdocuments in the array is matched the entire enclosing document
is a match for your query. Since you did a projection on only
"produit_up", just that array (including all array members) is
returned. Use $elemMatch in MongoDB 2.2 to only return the exact
matching array element.
MongoDB and PyMongo are working as designed here.
To get the behavior I think you're asking for see the $elemMatch operator:
http://docs.mongodb.org/manual/reference/projection/elemMatch/