1/3 OSD down in Ceph Cluster after 95% of the storage consume

1/3 OSD down in Ceph Cluster after 95% of the storage consume - ceph

I'm new to Ceph technology so I may not know obvious stuff. I started deploying ceph cluster using cephadm and I did. In my first attempt I gave each node 3 GB RAM (After some time I figured out that it needs more). My cluster hanged when one of the node's ram and swap filled up to 100%. Now I give each node 8GB ram and 10GB SSD swap and It's fixed:
Node01:
4x CPU , 8GB RAM, 60GB SSD
Node02:
4x CPU , 6GB RAM , 60GB SSD
Node04:
4x CPU , 8GB RAM , 60GB HDD
I start using it by creating a CephFS (It creates 2 pools one for data and one for metadata (3x replica rule)). I mount this FS on an Ubuntu 20.04 using ceph-common:
>> vim /etc/fstab
...
ceph-n01,ceph-n02:/ /ceph ceph _netdev,name=amin,secretfile=/home/amin/.ceph 0 0
It works fine. I use this fs by running a service that render a map and save the tiles in the filesystem (my CephFS pool). It works for about 1 day and a half and generates ~56.65GB file). On the second day I saw that 1 OSD (OSD with HDD) and only two OSDs running.
I checked RAM and CPU status for the 3 nodes. In 2 nodes 50% of the RAM was used and in one node (node 01) 85% of the RAM was used with ~4GB swap. I tried to fix the issue by restarting the OSD. the OSD which was down kept crashing when I restarted them. (the OSDs which was running before, started successfully after restart.
I looked at OSD logs:
debug -11> 2022-01-06T12:43:01.620+0000 7feaa5390080 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1641472981624364, "cf_name": "default", "job": 1, "event": "table_file_creation", "file_number": 3219, "file_size": 6014, "table_properties": {"data_size": 4653, "index_size": 64, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 0, "filter_size": 453, "raw_key_size": 2663, "raw_average_key_size": 17, "raw_value_size": 2408, "raw_average_value_size": 16, "num_data_blocks": 2, "num_entries": 148, "num_deletions": 0, "num_merge_operands": 147, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter", "column_family_name": "default", "column_family_id": 0, "comparator": "leveldb.BytewiseComparator", "merge_operator": ".T:int64_array.b:bitwise_xor", "prefix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "NoCompression", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1641472981, "oldest_key_time": 3, "file_creation_time": 0}}
debug -10> 2022-01-06T12:43:01.652+0000 7feaa5390080 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1641472981656222, "cf_name": "p-0", "job": 1, "event": "table_file_creation", "file_number": 3220, "file_size": 5104328, "table_properties": {"data_size": 4982595, "index_size": 70910, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 0, "filter_size": 49989, "raw_key_size": 973446, "raw_average_key_size": 48, "raw_value_size": 4492366, "raw_average_value_size": 224, "num_data_blocks": 1298, "num_entries": 19980, "num_deletions": 10845, "num_merge_operands": 0, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter", "column_family_name": "p-0", "column_family_id": 4, "comparator": "leveldb.BytewiseComparator", "merge_operator": "nullptr", "prefix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "NoCompression", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1641472981, "oldest_key_time": 3, "file_creation_time": 0}}
debug -9> 2022-01-06T12:43:01.688+0000 7feaa5390080 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1641472981692873, "cf_name": "p-2", "job": 1, "event": "table_file_creation", "file_number": 3221, "file_size": 5840600, "table_properties": {"data_size": 5701923, "index_size": 81198, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 0, "filter_size": 56645, "raw_key_size": 1103994, "raw_average_key_size": 48, "raw_value_size": 5146222, "raw_average_value_size": 227, "num_data_blocks": 1485, "num_entries": 22623, "num_deletions": 12166, "num_merge_operands": 0, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter", "column_family_name": "p-2", "column_family_id": 6, "comparator": "leveldb.BytewiseComparator", "merge_operator": "nullptr", "prefix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "NoCompression", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1641472981, "oldest_key_time": 3, "file_creation_time": 0}}
debug -8> 2022-01-06T12:43:01.688+0000 7feaa5390080 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1641472981694121, "cf_name": "O-0", "job": 1, "event": "table_file_creation", "file_number": 3222, "file_size": 73885, "table_properties": {"data_size": 72021, "index_size": 588, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 0, "filter_size": 453, "raw_key_size": 9444, "raw_average_key_size": 60, "raw_value_size": 63028, "raw_average_value_size": 406, "num_data_blocks": 18, "num_entries": 155, "num_deletions": 0, "num_merge_operands": 0, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter", "column_family_name": "O-0", "column_family_id": 7, "comparator": "leveldb.BytewiseComparator", "merge_operator": "nullptr", "prefix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "NoCompression", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1641472981, "oldest_key_time": 3, "file_creation_time": 0}}
debug -7> 2022-01-06T12:43:01.688+0000 7feaa5390080 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1641472981695243, "cf_name": "O-1", "job": 1, "event": "table_file_creation", "file_number": 3223, "file_size": 71023, "table_properties": {"data_size": 69158, "index_size": 589, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 0, "filter_size": 453, "raw_key_size": 9028, "raw_average_key_size": 61, "raw_value_size": 60508, "raw_average_value_size": 408, "num_data_blocks": 18, "num_entries": 148, "num_deletions": 0, "num_merge_operands": 0, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter", "column_family_name": "O-1", "column_family_id": 8, "comparator": "leveldb.BytewiseComparator", "merge_operator": "nullptr", "prefix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "NoCompression", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1641472981, "oldest_key_time": 3, "file_creation_time": 0}}
debug -6> 2022-01-06T12:43:01.692+0000 7feaa5390080 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1641472981696397, "cf_name": "O-2", "job": 1, "event": "table_file_creation", "file_number": 3224, "file_size": 75263, "table_properties": {"data_size": 73370, "index_size": 617, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 0, "filter_size": 453, "raw_key_size": 9679, "raw_average_key_size": 60, "raw_value_size": 64238, "raw_average_value_size": 404, "num_data_blocks": 19, "num_entries": 159, "num_deletions": 0, "num_merge_operands": 0, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter", "column_family_name": "O-2", "column_family_id": 9, "comparator": "leveldb.BytewiseComparator", "merge_operator": "nullptr", "prefix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "NoCompression", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1641472981, "oldest_key_time": 3, "file_creation_time": 0}}
debug -5> 2022-01-06T12:43:01.696+0000 7feaa5390080 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1641472981700113, "cf_name": "L", "job": 1, "event": "table_file_creation", "file_number": 3225, "file_size": 338198, "table_properties": {"data_size": 335953, "index_size": 1100, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 0, "filter_size": 325, "raw_key_size": 1712, "raw_average_key_size": 16, "raw_value_size": 333803, "raw_average_value_size": 3119, "num_data_blocks": 39, "num_entries": 107, "num_deletions": 68, "num_merge_operands": 0, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter", "column_family_name": "L", "column_family_id": 10, "comparator": "leveldb.BytewiseComparator", "merge_operator": "nullptr", "prefix_extractor_name": "nullptr", "property_collectors": "[]", "compression": "NoCompression", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1641472981, "oldest_key_time": 3, "file_creation_time": 0}}
debug -4> 2022-01-06T12:43:01.784+0000 7feaa5390080 1 bluefs _allocate unable to allocate 0x80000 on bdev 1, allocator name block, allocator type bitmap, capacity 0xeffc00000, block size 0x1000, free 0x10c7c2000, fragmentation 1, allocated 0x0
debug -3> 2022-01-06T12:43:01.784+0000 7feaa5390080 -1 bluefs _allocate allocation failed, needed 0x7b2e8
debug -2> 2022-01-06T12:43:01.784+0000 7feaa5390080 -1 bluefs _flush_range allocated: 0xc90000 offset: 0xc8a944 length: 0x809a4
debug -1> 2022-01-06T12:43:01.792+0000 7feaa5390080 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7feaa5390080 time 2022-01-06T12:43:01.789216+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/os/bluestore/BlueFS.cc: 2768: ceph_abort_msg("bluefs enospc")
ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x558699f6ac8c]
2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1131) [0x55869a661901]
3: (BlueFS::_flush(BlueFS::FileWriter*, bool, bool*)+0x90) [0x55869a661be0]
4: (BlueFS::_flush(BlueFS::FileWriter*, bool, std::unique_lock<std::mutex>&)+0x32) [0x55869a672cf2]
5: (BlueRocksWritableFile::Append(rocksdb::Slice const&)+0x11b) [0x55869a68b32b]
6: (rocksdb::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x55869ab1dacf]
7: (rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned long)+0x58a) [0x55869ac2f81a]
8: (rocksdb::WritableFileWriter::Append(rocksdb::Slice const&)+0x2d0) [0x55869ac30c70]
9: (rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle*, bool)+0xb6) [0x55869ad4c416]
10: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle*, bool)+0x26c) [0x55869ad4cd5c]
11: (rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, rocksdb::BlockHandle*, bool)+0x3c) [0x55869ad4d47c]
12: (rocksdb::BlockBasedTableBuilder::Flush()+0x6d) [0x55869ad4d50d]
13: (rocksdb::BlockBasedTableBuilder::Add(rocksdb::Slice const&, rocksdb::Slice const&)+0x2b8) [0x55869ad50978]
14: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned long)+0xa45) [0x55869acfb3d5]
15: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcf5) [0x55869ab60415]
16: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool, bool*)+0x1c2e) [0x55869ab62b4e]
17: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*)+0xae8) [0x55869ab63ea8]
18: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool)+0x59d) [0x55869ab5dbcd]
19: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x15) [0x55869ab5ef65]
20: (RocksDBStore::do_open(std::ostream&, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x10c1) [0x55869aad6ec1]
21: (BlueStore::_open_db(bool, bool, bool)+0x948) [0x55869a55c4d8]
22: (BlueStore::_open_db_and_around(bool, bool)+0x2f7) [0x55869a5c6657]
23: (BlueStore::_mount()+0x204) [0x55869a5c9514]
24: (OSD::init()+0x380) [0x55869a09ea10]
25: main()
26: __libc_start_main()
27: _start()
debug 0> 2022-01-06T12:43:01.796+0000 7feaa5390080 -1 *** Caught signal (Aborted) **
in thread 7feaa5390080 thread_name:ceph-osd
the above is the log in the down OSD.
I start reading it and find a useful log to search in google:
bluefs _allocate unable to allocate 0x80000 on bdev 1, allocator name block, allocator type bitmap, capacity 0xeffc00000, block size 0x1000, free 0x10c7c2000, fragmentation 1, allocated 0x0
I found a bug which was related to version 16.2.1 (I use 16.2.6):
https://tracker.ceph.com/issues/50656
I wanted to get a dump from my OSD (I dont understand completly what they say):
ceph daemon osd.1 bluestore allocator dump block
Can't get admin socket path: unable to get conf option admin_socket for osd: b"error parsing 'osd': expected string of the form TYPE.ID, valid types are: auth, mon, osd, mds, mgr, client\n"
I deploy cluster using cephadm which uses containers so I can not access socket in this fashion I think. this command leads me to use ceph-bluestore-tool to see state of my physical disk (see capacity, run fsck or repair) but running ceph-bluestore-tool needs to specify the --path of the osd which I cant run it from host (my containers keep crashing so I can't run this command inside the container). I tried to run a command in osd container using cephadm but can't find anyway to do this.
If you need the full log here tell me (I could not send it due to char limit) but it was the same crash log
I dont really understand whats going on.
I tried to use ceph-volume to mount the block device to the host to use ceph-bluestore-tool on it to use fsck or repair on it. (it needs a --path argument to point to the osd files) (I dont even know if it is the correct to use ceph-volume this way or it's built for this - as I told I am new to Ceph)
I tried to use cephadm to run ceph-bluestore-tool commands in crashed OSDs but I couldn't.
(the socket error I mentiond above)
my SSD OSDs was filled up 94% so the other still have free space on them (as I guess).
the only lead that I could find on the internet was not working.
I'm really desperate to find the answer. I will be happy if you can help me. even tell me to read a document or learn something.
I WILL POST SOME INFORMATION ABOUT MY CLUSTER DOWN HERE:
[The Ceph Dashboard][1]
>> ceph -s
cluster:
id: 1ad06d18-3e72-11ec-8684-fd37cdad1703
health: HEALTH_WARN
mons ceph-n01,ceph-n02,ceph-n04 are low on available space
2 backfillfull osd(s)
Degraded data redundancy: 4282822/12848466 objects degraded (33.333%), 64 pgs degraded, 96 pgs undersized
3 pool(s) backfillfull
6 daemons have recently crashed
services:
mon: 3 daemons, quorum ceph-n01,ceph-n04,ceph-n02 (age 7h)
mgr: ceph-n02.xyrntr(active, since 4w), standbys: ceph-n04.srrvqt
mds: 1/1 daemons up, 1 standby
osd: 3 osds: 2 up (since 6h), 2 in (since 6h)
data:
volumes: 1/1 healthy
pools: 3 pools, 96 pgs
objects: 4.28M objects, 42 GiB
usage: 113 GiB used, 6.7 GiB / 120 GiB avail
pgs: 4282822/12848466 objects degraded (33.333%)
64 active+undersized+degraded
32 active+undersized
>> ceph orch ls
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
alertmanager ?:9093,9094 1/1 4m ago 8w count:1
crash 3/3 6m ago 8w *
grafana ?:3000 1/1 4m ago 8w count:1
mds.cephfs 2/2 6m ago 7w label:mds
mgr 2/2 6m ago 8w label:mgr
mon 3/5 6m ago 3w count:5
node-exporter ?:9100 3/3 6m ago 8w *
osd 3/3 6m ago - <unmanaged>
prometheus ?:9095 1/1 4m ago 8w count:1
>> ceph orch host ls
HOST ADDR LABELS STATUS
ceph-n01 192.168.2.20 _admin mon mds
ceph-n02 192.168.2.21 mon mgr
ceph-n04 192.168.2.23 _admin mon mds mgr
>> ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.17578 root default
-3 0.05859 host ceph-n01
1 ssd 0.05859 osd.1 up 1.00000 1.00000
-5 0.05859 host ceph-n02
0 ssd 0.05859 osd.0 up 1.00000 1.00000
-10 0.05859 host ceph-n04
2 hdd 0.05859 osd.2 down 0 1.00000
>> ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
1 ssd 0.05859 1.00000 60 GiB 57 GiB 50 GiB 2.1 GiB 4.6 GiB 3.4 GiB 94.37 1.00 96 up
0 ssd 0.05859 1.00000 60 GiB 57 GiB 50 GiB 2.1 GiB 4.6 GiB 3.4 GiB 94.41 1.00 96 up
2 hdd 0.05859 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
TOTAL 120 GiB 113 GiB 100 GiB 4.3 GiB 9.1 GiB 6.7 GiB 94.39
MIN/MAX VAR: 1.00/1.00 STDDEV: 0.02
node01: >> free -m
total used free shared buff/cache available
Mem: 7956 7121 175 0 659 573
Swap: 10238 3748 6490
node02: >> free -m
total used free shared buff/cache available
Mem: 7956 7121 175 0 659 573
Swap: 10238 3748 6490
node04: >> free -m
total used free shared buff/cache available
Mem: 7922 2260 3970 1 1690 5371
Swap: 10238 642 9596

Related

Problem combining vehicule capacities and time windows using or-tools

I am trying to solve VRP problem. I am following this file that runs smooth as silk.
The problem is that, when I modify it to fit my problem, it does not seems to work correctly. I don't know where the difference may lay.
Some details:
15 vehicles.
7 bases (vehicles depart from these bases).
10 demands locations (17 nodes in total).
Costs for all vehicles are the same.
Demands are all constant and equal to 1.
The duration of the service varies depending on the location.
The time window available for each location is the same (8 am to 5 pm). So there is no tight schedule, as long as the places are visited within the work shift.
The problem:
When I set the capacity of my vehicle to a low value (say, 2), then several vehicles go out to solve some task (with a maximum of 2 per vehicle, of course). This works fine, and the output is satisfactory. However, this is not the scenario I am looking for. Actually, my vehicles have the ability to perform as many tasks as they can, as long as the total time required for those tasks does not exceed the number of daily working hours (8am-5pm). However, if I set my capabilities to 100 (let's say infinity for this case), the solution indicates that only one vehicle comes out and it does all the tasks, even though all the tasks add up to 31 hours!
Questions:
What parameters am I using wrong?
Why does start_fn is not used anywhere, does it have to do with this?
will it be related to the first_solution_strategy? This shouldn't be the cause, right? The contradiction is the vehicle working 31 hours, when the maximum period set is 9 hours.
I don't understand why when we do: routing.AddDimension(.....'time') we put customers.time_horizon twice.
Why is the duration of the route equal to zero in all cases? This can be right.
I am lost with this. If someone can help me I will be very grateful
So, Here is my Code:
import numpy as np
from collections import namedtuple
from ortools.constraint_solver import pywrapcp
from ortools.constraint_solver import routing_enums_pb2
from datetime import timedelta
class Ordenes():
def __init__(self,
lats_in,
lons_in,
num_bases,
distmap,
ot_time,
num_stops,
demands,
time_horizon = 24 * 60 * 60): # A 24 hour period.
self.number = num_stops #: The number of customers and depots
self.distmap = distmap #: the distance matrix
stops = np.array(range(0, num_stops)) # The 'name' of the stop, indexed from 0 to num_stops-1
self.num_bases = num_bases
self.time_horizon = time_horizon
lats, lons = lats_in,lons_in
# timedeltas: Demands locations are available from 8am to 6pm.
stime, etime = int(8 * 60 * 60), int(18 * 60 * 60)
start_times = [timedelta(seconds = stime) for idx in range(self.number)]
stop_times = [timedelta(seconds = etime) for idx in range(self.number)]
# A named tuple for the Orden
Orden = namedtuple(
'Orden',
[
'index' , # the index of the stop
'demand' , # the demand for the stop
'lat' , # the latitude of the stop
'lon' , # the longitude of the stop
'tw_open', # timedelta window open
'tw_close'
]) # timedelta window cls
ziped_ = zip(stops, demands, lats, lons, start_times, stop_times)
self.ordenes = [
Orden(idx, dem, lat, lon, tw_open, tw_close)
for idx, dem, lat, lon, tw_open, tw_close in ziped_]
# The number of seconds needed to 'unload' 1 unit of goods its variable.
self.service_time_per_dem = ot_time # seconds
def set_manager(self, manager):
self.manager = manager
def central_start_node(self):
return range(self.num_bases)
def make_distance_mat(self):
return (self.distmap)
def get_total_demand(self):
return (sum([c.demand for c in self.ordenes]))
def return_dist_callback(self, **kwargs):
self.make_distance_mat()
def dist_return(from_index, to_index):
# Convert from routing variable Index to distance matrix NodeIndex.
from_node = self.manager.IndexToNode(from_index)
to_node = self.manager.IndexToNode(to_index)
return (self.distmat[from_node][to_node])
return dist_return
def return_dem_callback(self):
def dem_return(from_index):
# Convert from routing variable Index to distance matrix NodeIndex.
from_node = self.manager.IndexToNode(from_index)
return (self.ordenes[from_node].demand)
return dem_return
def zero_depot_demands(self, depot):
start_depot = self.ordenes[depot]
self.ordenes[depot] = start_depot._replace(
demand = 0 ,
tw_open = None,
tw_close = None)
return
def make_service_time_call_callback(self):
def service_time_return(a, b):
return (self.service_time_per_dem[a])
return service_time_return
def make_transit_time_callback(self, speed_kmph=30):
def transit_time_return(a, b):
return (self.distmat[a][b] / (speed_kmph * 1 / 3600))
return transit_time_return
class Cuadrillas():
def __init__(self,
capacities,
costs,
start_nodes):
Cuadrilla = namedtuple(
'Cuadrilla',
['index','capacity','cost'])
self.start_nodes = start_nodes
self.number = np.size(capacities)
idxs = np.array(range(0, self.number))
zip_in_ = zip(idxs, capacities, costs)
self.cuadrillas = [Cuadrilla(idx, cap, cost) for idx, cap, cost in zip_in_ ]
def get_total_capacity(self):
return (sum([c.capacity for c in self.cuadrillas]))
def return_starting_callback(self, ordenes):
# create a different starting and finishing depot for each vehicle
self.starts = self.start_nodes
self.ends = self.starts
# the depots will not have demands, so zero them.
for depot in self.starts:
ordenes.zero_depot_demands(depot)
for depot in self.ends:
ordenes.zero_depot_demands(depot)
def start_return(v):
return (self.starts[v])
return start_return
def vehicle_output_string(manager, routing, plan):
"""
Return a string displaying the output of the routing instance and
assignment (plan).
Args: routing (ortools.constraint_solver.pywrapcp.RoutingModel): routing.
plan (ortools.constraint_solver.pywrapcp.Assignment): the assignment.
Returns:
(string) plan_output: describing each vehicle's plan.
(List) dropped: list of dropped orders.
"""
dropped = []
for order in range(routing.Size()):
if (plan.Value(routing.NextVar(order)) == order):
dropped.append(str(order))
capacity_dimension = routing.GetDimensionOrDie('Capacity')
time_dimension = routing.GetDimensionOrDie('Time')
plan_output = ''
for route_number in range(routing.vehicles()):
order = routing.Start(route_number)
plan_output += 'Route {0}:'.format(route_number)
if routing.IsEnd(plan.Value(routing.NextVar(order))):
plan_output += ' Empty \n'
else:
while True:
load_var = capacity_dimension.CumulVar(order)
time_var = time_dimension.CumulVar(order)
node = manager.IndexToNode(order)
plan_output += \
' {node} Load({load}) Time({tmin}, {tmax}) -> '.format(
node=node,
load=plan.Value(load_var),
tmin=str(timedelta(seconds=plan.Min(time_var))),
tmax=str(timedelta(seconds=plan.Max(time_var))))
if routing.IsEnd(order):
plan_output += ' EndRoute {0}. \n'.format(route_number)
break
order = plan.Value(routing.NextVar(order))
plan_output += '\n'
return (plan_output, dropped)
# [START solution_printer]
def print_solution(manager, routing, assignment):
"""Prints solution on console."""
print(f'Objective: {assignment.ObjectiveValue()}')
# Display dropped nodes.
dropped_nodes = 'Dropped nodes:'
for index in range(routing.Size()):
if routing.IsStart(index) or routing.IsEnd(index):
continue
if assignment.Value(routing.NextVar(index)) == index:
node = manager.IndexToNode(index)
if node > 16:
original = node
while original > 16:
original = original - 16
dropped_nodes += f' {node}({original})'
else:
dropped_nodes += f' {node}'
print(dropped_nodes)
# Display routes
time_dimension = routing.GetDimensionOrDie('Time')
total_time = 0
for vehicle_id in range(manager.GetNumberOfVehicles()):
plan_output = f'Route for vehicle {vehicle_id}:\n'
index = routing.Start(vehicle_id)
start_time = 0
while not routing.IsEnd(index):
time_var = time_dimension.CumulVar(index)
node = manager.IndexToNode(index)
if node > 16:
original = node
while original > 16:
original = original - 16
plan_output += f'{node}({original})'
else:
plan_output += f'{node}'
plan_output += f' Time:{assignment.Value(time_var)} -> '
if start_time == 0:
start_time = assignment.Value(time_var)
index = assignment.Value(routing.NextVar(index))
time_var = time_dimension.CumulVar(index)
node = manager.IndexToNode(index)
plan_output += f'{node} Time:{assignment.Value(time_var)}\n'
end_time = assignment.Value(time_var)
duration = end_time - start_time
plan_output += f'Duration of the route:{duration}min\n'
print(plan_output)
total_time += duration
print(f'Total duration of all routes: {total_time}min')
# [END solution_printer]
def main():
# coordinates
lats = [-45.80359358,-45.76451539,-45.80393496,-45.7719334,-45.76607548,
-45.89857917,-45.70923876,-46.10321727,-45.81709206,-46.27827033,
-45.67994619,-45.73426141,-45.89791315,-45.74206645,-46.226577,
-46.08164013,-45.98688936]
lons = [-68.20091669, -68.0438965, -68.67399508, -68.11662549, -68.17842196
-68.32238459, -68.23153574, -68.74653904, -68.7490935 , -68.88576051,
-68.28244657, -68.29355024, -68.52404867, -68.92559956, -69.00577607,
-68.51192289, -68.65117288]
# Demand duration
ot_time = [0, 0, 0, 0, 0, 0, 0, 5400, 5400, 43200, 2520, 2520, 5400, 2520, 12600, 2520, 2520]
print(np.sum(ot_time)/3600)
# Number of Stops:
num_stops = 17
# Number of bases: The first 7 nodes
num_bases = 7
# demands: ONly one demand for each spot
demands = np.ones(num_stops)
# Distance matrix:
distmat =[
[ 0, 0, 22, 0, 0, 0, 0, 35, 35, 69, 1, 1, 25, 54, 61, 35, 35],
[ 0, 0, 22, 0, 0, 0, 0, 35, 35, 69, 1, 1, 25, 54, 62, 35, 35],
[22, 22, 0, 22, 22, 22, 22, 20, 21, 51, 20, 20, 13, 32, 46, 20, 21],
[ 0, 0, 22, 0, 0, 0, 0, 35, 35, 69, 1, 1, 25, 54, 61, 35, 35],
[ 0, 0, 22, 0, 0, 0, 0, 35, 35, 69, 1, 1, 25, 54, 61, 35, 35],
[ 0, 0, 22, 0, 0, 0, 0, 35, 35, 69, 1, 1, 25, 54, 62, 35, 35],
[ 0, 0, 22, 0, 0, 0, 0, 35, 35, 69, 1, 1, 25, 54, 62, 35, 35],
[35, 35, 20, 35, 35, 35, 35, 0, 0, 33, 34, 34, 9, 27, 26, 0, 0],
[35, 35, 21, 35, 35, 35, 35, 0, 0, 33, 34, 34, 9, 27, 26, 0, 0],
[69, 69, 51, 69, 69, 69, 69, 33, 33, 0, 68, 68, 43, 31, 10, 33, 33],
[ 1, 1, 20, 1, 1, 1, 1, 34, 34, 68, 0, 0, 25, 52, 61, 34, 34],
[ 1, 1, 20, 1, 1, 1, 1, 34, 34, 68, 0, 0, 25, 52, 61, 34, 34],
[25, 25, 13, 25, 25, 25, 25, 9, 9, 43, 25, 25, 0, 32, 36, 9, 9],
[54, 54, 32, 54, 54, 54, 54, 27, 27, 31, 52, 52, 32, 0, 33, 27, 27],
[61, 62, 46, 61, 61, 62, 62, 26, 26, 10, 61, 61, 36, 33, 0, 26, 26],
[35, 35, 20, 35, 35, 35, 35, 0, 0, 33, 34, 34, 9, 27, 26, 0, 0],
[35, 35, 21, 35, 35, 35, 35, 0, 0, 33, 34, 34, 9, 27, 26, 0, 0]]
# Create a set of customer, (and depot) stops.
customers = Ordenes(
lats_in = lats,
lons_in = lons,
num_bases = num_bases,
distmap = distmat,
ot_time = ot_time,
demands = demands,
num_stops = num_stops)
# All vehicule capacities are the same ---> 15 vehicules
veh_cap = 2 #100 does not work, why?
capacity = [int(x) for x in np.ones(15)*veh_cap]
# Constant cost for all vehicules: 100
cost = [int(x) for x in np.ones(15)*100]
# Get the starting nodes of "cuadrillas"
start_n = [
0,# vehicule 1 departs from and returns to base 0
1,# vehicule 2 departs from and returns to base 1
2,# vehicule 3 departs from and returns to base 2
2,# vehicule 4 departs from and returns to base 2
2,# vehicule 5 departs from and returns to base 2
3,# vehicule 6 departs from and returns to base 3
3,# vehicule 7 departs from and returns to base 3
4,# vehicule 8 departs from and returns to base 4
4,# vehicule 9 departs from and returns to base 4
4,# vehicule 10 departs from and returns to base 4
5,# vehicule 11 departs from and returns to base 5
5,# vehicule 12 departs from and returns to base 5
6,# vehicule 13 departs from and returns to base 6
6,# vehicule 14 departs from and returns to base 6
6]# vehicule 15 departs from and returns to base 6
# Get penalties for dropping nodes
penalty = [
0,
0,
0,
0,
0,
0,
0,
8636,
8636,
8596,
8571,
8571,
8556,
8551,
8495,
8490,
8490]
# Create a set of cuadrillas, the number set by the length of capacity.
vehicles = Cuadrillas(
capacities = capacity,
costs = cost,
start_nodes = start_n)
# Set the starting nodes, and create a callback fn for the starting node.
start_fn = vehicles.return_starting_callback(customers)
# Create the routing index manager.
manager = pywrapcp.RoutingIndexManager(
customers.number, # int number
vehicles.number, # int number
vehicles.starts, # List of int start depot
vehicles.ends) # List of int end depot
# Get customers a manager attribute
customers.set_manager(manager)
# Set model parameters
model_parameters = pywrapcp.DefaultRoutingModelParameters()
# Make the routing model instance.
routing = pywrapcp.RoutingModel(manager, model_parameters)
parameters = pywrapcp.DefaultRoutingSearchParameters()
# Setting first solution heuristic (cheapest addition).
parameters.first_solution_strategy = (
routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC)
# Routing: forbids use of TSPOpt neighborhood, (this is the default behaviour)
parameters.local_search_operators.use_tsp_opt = pywrapcp.BOOL_FALSE
# Disabling Large Neighborhood Search, (this is the default behaviour)
parameters.local_search_operators.use_path_lns = pywrapcp.BOOL_FALSE
parameters.local_search_operators.use_inactive_lns = pywrapcp.BOOL_FALSE
parameters.time_limit.seconds = 20
parameters.use_full_propagation = True
#parameters.log_search = True
# Create callback fns for distances, demands, service and transit-times.
dist_fn = customers.return_dist_callback()
dist_fn_index = routing.RegisterTransitCallback(dist_fn)
dem_fn = customers.return_dem_callback()
dem_fn_index = routing.RegisterUnaryTransitCallback(dem_fn)
# Create and register a transit callback.
serv_time_fn = customers.make_service_time_call_callback()
transit_time_fn = customers.make_transit_time_callback()
def tot_time_fn(from_index, to_index):
"""
The time function we want is both transit time and service time.
"""
# Convert from routing variable Index to distance matrix NodeIndex.
from_node = manager.IndexToNode(from_index)
to_node = manager.IndexToNode(to_index)
return serv_time_fn(from_node, to_node) + transit_time_fn(from_node, to_node)
tot_time_fn_index = routing.RegisterTransitCallback(tot_time_fn)
# Set the cost function (distance callback) for each arc, homogeneous for
# all vehicles.
routing.SetArcCostEvaluatorOfAllVehicles(dist_fn_index)
# Set vehicle costs for each vehicle, not homogeneous.
for veh in vehicles.cuadrillas:
routing.SetFixedCostOfVehicle(veh.cost, int(veh.index))
# Add a dimension for vehicle capacities
null_capacity_slack = 0
routing.AddDimensionWithVehicleCapacity(
dem_fn_index, # demand callback
null_capacity_slack,
capacity, # capacity array
True,
'Capacity')
# Add a dimension for time and a limit on the total time_horizon
routing.AddDimension(
tot_time_fn_index, # total time function callback
customers.time_horizon, #28800, # 8am
customers.time_horizon, #61200, # 5pm
True,
'Time')
time_dimension = routing.GetDimensionOrDie('Time')
for cust in customers.ordenes:
if cust.tw_open is not None:
time_dimension.CumulVar(
manager.NodeToIndex(cust.index)).SetRange(
cust.tw_open.seconds,
cust.tw_close.seconds)
"""
To allow the dropping of orders, we add disjunctions to all the customer
nodes. Each disjunction is a list of 1 index, which allows that customer to
be active or not, with a penalty if not. The penalty should be larger
than the cost of servicing that customer, or it will always be dropped!
"""
# To add disjunctions just to the customers, make a list of non-depots.
non_depot = set(range(customers.number))
non_depot.difference_update(vehicles.starts) # removes the items that exist in both sets.
non_depot.difference_update(vehicles.ends) # removes the items that exist in both sets.
nodes = [routing.AddDisjunction([manager.NodeToIndex(c)], penalty[c]) for c in non_depot]
# Solve the problem !
assignment = routing.SolveWithParameters(parameters)
# The rest is all optional for saving, printing or plotting the solution.
if assignment:
print('The Objective Value is {0}'.format(assignment.ObjectiveValue()))
plan_output, dropped = vehicle_output_string(manager, routing, assignment)
print(plan_output)
print('dropped nodes: ' + ', '.join(dropped))
print("\n#####################")
print("#####################")
print_solution(manager, routing, assignment)
else:
print('No assignment')
return # main
if __name__ == '__main__':
main()
The output using capacity equal 100.
Objective: 100
Dropped nodes:
Route for vehicle 0:
0 Time:0 -> 0 Time:0
Duration of the route:0min
Route for vehicle 1:
1 Time:0 -> 1 Time:0
Duration of the route:0min
Route for vehicle 2:
2 Time:0 -> 2 Time:0
Duration of the route:0min
Route for vehicle 3:
2 Time:0 -> 2 Time:0
Duration of the route:0min
Route for vehicle 4:
2 Time:0 -> 2 Time:0
Duration of the route:0min
Route for vehicle 5:
3 Time:0 -> 3 Time:0
Duration of the route:0min
Route for vehicle 6:
3 Time:0 -> 3 Time:0
Duration of the route:0min
Route for vehicle 7:
4 Time:0 -> 4 Time:0
Duration of the route:0min
Route for vehicle 8:
4 Time:0 -> 4 Time:0
Duration of the route:0min
Route for vehicle 9:
4 Time:0 -> 4 Time:0
Duration of the route:0min
Route for vehicle 10:
5 Time:0 -> 5 Time:0
Duration of the route:0min
Route for vehicle 11:
5 Time:0 -> 5 Time:0
Duration of the route:0min
Route for vehicle 12:
6 Time:0 -> 6 Time:0
Duration of the route:0min
Route for vehicle 13:
6 Time:0 -> 6 Time:0
Duration of the route:0min
Route for vehicle 14:
6 Time:0 -> 16 Time:0 -> 15 Time:28800 -> 14 Time:28800 -> 13 Time:28800 -> 12 Time:28800 -> 11 Time:28800 -> 10 Time:28800 -> 9 Time:28800 -> 8 Time:28800 -> 7 Time:28800 -> 6 Time:28800
Duration of the route:0min
Total duration of all routes: 0min
Solution using capacity equal 2:
Objective: 500
Dropped nodes:
Route for vehicle 0:
0 Time:0 -> 0 Time:0
Duration of the route:0min
Route for vehicle 1:
1 Time:0 -> 1 Time:0
Duration of the route:0min
Route for vehicle 2:
2 Time:0 -> 2 Time:0
Duration of the route:0min
Route for vehicle 3:
2 Time:0 -> 2 Time:0
Duration of the route:0min
Route for vehicle 4:
2 Time:0 -> 2 Time:0
Duration of the route:0min
Route for vehicle 5:
3 Time:0 -> 3 Time:0
Duration of the route:0min
Route for vehicle 6:
3 Time:0 -> 3 Time:0
Duration of the route:0min
Route for vehicle 7:
4 Time:0 -> 4 Time:0
Duration of the route:0min
Route for vehicle 8:
4 Time:0 -> 7 Time:28800 -> 4 Time:28800
Duration of the route:0min
Route for vehicle 9:
4 Time:0 -> 9 Time:28800 -> 8 Time:28800 -> 4 Time:28800
Duration of the route:0min
Route for vehicle 10:
5 Time:0 -> 5 Time:0
Duration of the route:0min
Route for vehicle 11:
5 Time:0 -> 11 Time:28800 -> 10 Time:28800 -> 5 Time:28800
Duration of the route:0min
Route for vehicle 12:
6 Time:0 -> 6 Time:0
Duration of the route:0min
Route for vehicle 13:
6 Time:0 -> 13 Time:28800 -> 12 Time:28800 -> 6 Time:28800
Duration of the route:0min
Route for vehicle 14:
6 Time:0 -> 16 Time:0 -> 15 Time:28800 -> 14 Time:28800 -> 6 Time:28800
Duration of the route:0min
Total duration of all routes: 0min

How ensure that parquet files contains row count in metadata?

Look at the sources: fast-parquet-row-count-in-spark and parquet-count-metadata-explanation
Stackoverflow and official spark documentation tells us that parquet file should contains row count in metadata. And spark added this by default since 1.6
I tried to see this "field" but have no luck. May be I am doing something wrong? Could somebody tell me how ensure that some parquet file has such filed? Any link to small but good parquet file welcome! For now I am invoking org.apache.parquet.tools.Main with arguments meta D:\myparquet_file.parquet and see no count key word in results.

You can inspect a parquet file using parquet-tools:
Install parquet-tools:
pip install parquet-tools
Create a parquet file. I used spark to create a small parquet file with 3 rows:
import spark.implicits._
val df: DataFrame = Seq((1, 2, 3), (4, 5, 6), (7, 8, 9)).toDF("col1", "col2", "col3")
df.coalesce(1).write.parquet("data/")
inspect the parquet file:
parquet-tools inspect /path/to/parquet/file
The output should be something like:
############ file meta data ############
created_by: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1)
num_columns: 3
num_rows: 3
num_row_groups: 1
format_version: 1.0
serialized_size: 654
############ Columns ############
col1
col2
col3
############ Column(col1) ############
name: col1
path: col1
max_definition_level: 0
max_repetition_level: 0
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
############ Column(col2) ############
name: col2
path: col2
max_definition_level: 0
max_repetition_level: 0
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
############ Column(col3) ############
name: col3
path: col3
max_definition_level: 0
max_repetition_level: 0
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
You can see under the file meta data section the num_rows field that represent the number of rows in the parquet file.

You can find the row count in the field RC just beside the row group.
row group 1: RC:148192 TS:10503944 OFFSET:4
Full output of parquet-tool with meta option below.
> parquet-tools meta part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet
file: file:/Users/matthewropp/team_demo/los-angeles-parking-citations/raw_citations/issue_month=201902/part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet
creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":":created_at","type":"string","nullable":true,"metadata":{}},{"name":":id","type":"string","nullable":true,"metadata":{}},{"name":":updated_at","type":"string","nullable":true,"metadata":{}},{"name":"agency","type":"integer","nullable":true,"metadata":{}},{"name":"body_style","type":"string","nullable":true,"metadata":{}},{"name":"color","type":"string","nullable":true,"metadata":{}},{"name":"fine_amount","type":"integer","nullable":true,"metadata":{}},{"name":"issue_date","type":"date","nullable":true,"metadata":{}},{"name":"issue_time","type":"integer","nullable":true,"metadata":{}},{"name":"latitude","type":"decimal(8,1)","nullable":true,"metadata":{}},{"name":"location","type":"string","nullable":true,"metadata":{}},{"name":"longitude","type":"decimal(8,1)","nullable":true,"metadata":{}},{"name":"make","type":"string","nullable":true,"metadata":{}},{"name":"marked_time","type":"string","nullable":true,"metadata":{}},{"name":"meter_id","type":"string","nullable":true,"metadata":{}},{"name":"plate_expiry_date","type":"date","nullable":true,"metadata":{}},{"name":"route","type":"string","nullable":true,"metadata":{}},{"name":"rp_state_plate","type":"string","nullable":true,"metadata":{}},{"name":"ticket_number","type":"string","nullable":false,"metadata":{}},{"name":"vin","type":"string","nullable":true,"metadata":{}},{"name":"violation_code","type":"string","nullable":true,"metadata":{}},{"name":"violation_description","type":"string","nullable":true,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
: created_at: OPTIONAL BINARY O:UTF8 R:0 D:1
: id: OPTIONAL BINARY O:UTF8 R:0 D:1
: updated_at: OPTIONAL BINARY O:UTF8 R:0 D:1
agency: OPTIONAL INT32 R:0 D:1
body_style: OPTIONAL BINARY O:UTF8 R:0 D:1
color: OPTIONAL BINARY O:UTF8 R:0 D:1
fine_amount: OPTIONAL INT32 R:0 D:1
issue_date: OPTIONAL INT32 O:DATE R:0 D:1
issue_time: OPTIONAL INT32 R:0 D:1
latitude: OPTIONAL INT32 O:DECIMAL R:0 D:1
location: OPTIONAL BINARY O:UTF8 R:0 D:1
longitude: OPTIONAL INT32 O:DECIMAL R:0 D:1
make: OPTIONAL BINARY O:UTF8 R:0 D:1
marked_time: OPTIONAL BINARY O:UTF8 R:0 D:1
meter_id: OPTIONAL BINARY O:UTF8 R:0 D:1
plate_expiry_date: OPTIONAL INT32 O:DATE R:0 D:1
route: OPTIONAL BINARY O:UTF8 R:0 D:1
rp_state_plate: OPTIONAL BINARY O:UTF8 R:0 D:1
ticket_number: REQUIRED BINARY O:UTF8 R:0 D:0
vin: OPTIONAL BINARY O:UTF8 R:0 D:1
violation_code: OPTIONAL BINARY O:UTF8 R:0 D:1
violation_description: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:148192 TS:10503944 OFFSET:4
--------------------------------------------------------------------------------
: created_at: BINARY SNAPPY DO:0 FPO:4 SZ:607/616/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-28T00:16:06.329Z, max: 2019-03-02T00:20:00.249Z, num_nulls: 0]
: id: BINARY SNAPPY DO:0 FPO:611 SZ:2365472/3260525/1.38 VC:148192 ENC:BIT_PACKED,PLAIN,RLE ST:[min: row-2229_y75z.ftdu, max: row-zzzs_4hta.8fub, num_nulls: 0]
: updated_at: BINARY SNAPPY DO:0 FPO:2366083 SZ:602/611/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-28T00:16:06.329Z, max: 2019-03-02T00:20:00.249Z, num_nulls: 0]
agency: INT32 SNAPPY DO:0 FPO:2366685 SZ:4871/5267/1.08 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 1, max: 58, num_nulls: 0]
body_style: BINARY SNAPPY DO:0 FPO:2371556 SZ:36244/61827/1.71 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: WR, num_nulls: 0]
color: BINARY SNAPPY DO:0 FPO:2407800 SZ:111267/111708/1.00 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YL, num_nulls: 0]
fine_amount: INT32 SNAPPY DO:0 FPO:2519067 SZ:71989/82138/1.14 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 25, max: 363, num_nulls: 63]
issue_date: INT32 SNAPPY DO:0 FPO:2591056 SZ:20872/23185/1.11 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-01, max: 2019-02-27, num_nulls: 0]
issue_time: INT32 SNAPPY DO:0 FPO:2611928 SZ:210026/210013/1.00 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 1, max: 2359, num_nulls: 41]
latitude: INT32 SNAPPY DO:0 FPO:2821954 SZ:508049/512228/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 99999.0, max: 6513161.2, num_nulls: 0]
location: BINARY SNAPPY DO:0 FPO:3330003 SZ:1251364/2693435/2.15 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,PLAIN,RLE ST:[min: , max: ZOMBAR/VALERIO, num_nulls: 0]
longitude: INT32 SNAPPY DO:0 FPO:4581367 SZ:516233/520692/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 99999.0, max: 1941557.4, num_nulls: 0]
make: BINARY SNAPPY DO:0 FPO:5097600 SZ:147034/150364/1.02 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YAMA, num_nulls: 0]
marked_time: BINARY SNAPPY DO:0 FPO:5244634 SZ:11675/17658/1.51 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: 959.0, num_nulls: 0]
meter_id: BINARY SNAPPY DO:0 FPO:5256309 SZ:172432/256692/1.49 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YO97, num_nulls: 0]
plate_expiry_date: INT32 SNAPPY DO:0 FPO:5428741 SZ:149849/152288/1.02 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2000-02-01, max: 2099-12-01, num_nulls: 18624]
route: BINARY SNAPPY DO:0 FPO:5578590 SZ:38377/45948/1.20 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: WTD, num_nulls: 0]
rp_state_plate: BINARY SNAPPY DO:0 FPO:5616967 SZ:33281/60186/1.81 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: AB, max: XX, num_nulls: 0]
ticket_number: BINARY SNAPPY DO:0 FPO:5650248 SZ:801039/2074791/2.59 VC:148192 ENC:BIT_PACKED,PLAIN ST:[min: 1020798376, max: 4350802142, num_nulls: 0]
vin: BINARY SNAPPY DO:0 FPO:6451287 SZ:64/60/0.94 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: , num_nulls: 0]
violation_code: BINARY SNAPPY DO:0 FPO:6451351 SZ:94784/131071/1.38 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 000, max: 8942, num_nulls: 0]
violation_description: BINARY SNAPPY DO:0 FPO:6546135 SZ:95937/132641/1.38 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YELLOW ZONE, num_nulls: 0]
> parquet-tools dump -m -c make part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet | head -20
BINARY make
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 148192 ***
value 1: R:0 D:1 V:HYDA
value 2: R:0 D:1 V:NISS
value 3: R:0 D:1 V:NISS
value 4: R:0 D:1 V:TOYO
value 5: R:0 D:1 V:AUDI
value 6: R:0 D:1 V:MERC
value 7: R:0 D:1 V:LEX
value 8: R:0 D:1 V:BMW
value 9: R:0 D:1 V:GMC
value 10: R:0 D:1 V:HOND
value 11: R:0 D:1 V:TOYO
value 12: R:0 D:1 V:NISS
value 13: R:0 D:1 V:
value 14: R:0 D:1 V:THOR
value 15: R:0 D:1 V:DODG
value 16: R:0 D:1 V:DODG
value 17: R:0 D:1 V:HOND

Loss records on spark kafka stream

i loss records in my kafka streams.
My kafka stream is in spark infra with that config.
val df = spark.
readStream.
format("kafka").
option("kafka.bootstrap.servers", broker_address).
option("subscribe", subject).
option("startingOffsets", "latest").
option("failOnDataLoss", "false").
load()
My sink is a parquet files.
df_vertex.writeStream.
format("parquet").
option("checkpointLocation", "/tmp/vertex/check").
option("path", data_location).
option("mode", "append").
trigger(Trigger.ProcessingTime("10 seconds")).
start().
awaitTermination()
When a read my parquets files some records are missing.
That's recors are present in kafka broker but not write on sink.
21/12/04 05:32:50 DEBUG KafkaDataConsumer: Get spark-kafka-source-2edd8854-b64d-4f1a-b130-135e0c2b4c56--578079920-executor rainbow_data_extractor-0 nextOffset 42141 requested $offset
21/12/04 05:32:50 DEBUG RecordConsumerLoggingWrapper: <!-- start message -->
21/12/04 05:32:50 DEBUG MessageColumnIO: < MESSAGE START >
21/12/04 05:32:50 DEBUG MessageColumnIO: 0, VistedIndex{vistedIndexes={}}: [] r:0
21/12/04 05:32:50 DEBUG RecordConsumerLoggingWrapper: <id>
21/12/04 05:32:50 DEBUG MessageColumnIO: startField(id, 0)
21/12/04 05:32:50 DEBUG MessageColumnIO: 0, VistedIndex{vistedIndexes={}}: [id] r:0
21/12/04 05:32:50 DEBUG RecordConsumerLoggingWrapper: [99, 102, 49, 48, 49, 48, 99, 98, 100, 53, 53, 49, 48, 51, 49, 98, 55, 53, 51, 97, 100, 55, 101, 49, 100, 99, 50, 49, 101, 100, 100, 100, 101, 51, 53, 55, 101, 100, 100, 49, 98, 99, 98, 51, 53, 101, 55, 100, 99, 97, 49, 56, 102, 51, 100, 49, 49, 55, 54, 53, 98, 5
5, 48, 56]
21/12/04 05:32:50 DEBUG MessageColumnIO: addBinary(64 bytes)
21/12/04 05:32:50 DEBUG MessageColumnIO: r: 0
21/12/04 05:32:50 DEBUG MessageColumnIO: 0, VistedIndex{vistedIndexes={}}: [id] r:0
21/12/04 05:32:50 DEBUG RecordConsumerLoggingWrapper: </id>
21/12/04 05:32:50 DEBUG MessageColumnIO: endField(id, 0)
21/12/04 05:32:50 DEBUG MessageColumnIO: 0, VistedIndex{vistedIndexes={0}}: [] r:0
21/12/04 05:32:50 DEBUG RecordConsumerLoggingWrapper: <type>
21/12/04 05:32:50 DEBUG MessageColumnIO: startField(type, 1)
21/12/04 05:32:50 DEBUG MessageColumnIO: 0, VistedIndex{vistedIndexes={0}}: [type] r:0
21/12/04 05:32:50 DEBUG RecordConsumerLoggingWrapper: [117, 115, 101, 114]
21/12/04 05:32:50 DEBUG MessageColumnIO: addBinary(4 bytes)
21/12/04 05:32:50 DEBUG MessageColumnIO: r: 0
21/12/04 05:32:50 DEBUG MessageColumnIO: 0, VistedIndex{vistedIndexes={0}}: [type] r:0
21/12/04 05:32:50 DEBUG RecordConsumerLoggingWrapper: </type>
21/12/04 05:32:50 DEBUG MessageColumnIO: endField(type, 1)
21/12/04 05:32:50 DEBUG MessageColumnIO: 0, VistedIndex{vistedIndexes={0, 1}}: [] r:0
21/12/04 05:32:50 DEBUG MessageColumnIO: [created_by].writeNull(0,0)
21/12/04 05:32:50 DEBUG MessageColumnIO: [created_on].writeNull(0,0)
21/12/04 05:32:50 DEBUG MessageColumnIO: [last_message].writeNull(0,0)
21/12/04 05:32:50 DEBUG MessageColumnIO: [is_active].writeNull(0,0)
21/12/04 05:32:50 DEBUG MessageColumnIO: [is_archived].writeNull(0,0)
21/12/04 05:32:50 DEBUG MessageColumnIO: < MESSAGE END >
21/12/04 05:32:50 DEBUG MessageColumnIO: 0, VistedIndex{vistedIndexes={0, 1}}: [] r:0
21/12/04 05:32:50 DEBUG RecordConsumerLoggingWrapper: <!-- end message -->
21/12/04 05:32:50 DEBUG KafkaDataConsumer: Get spark-kafka-source-2edd8854-b64d-4f1a-b130-135e0c2b4c56--578079920-executor rainbow_data_extractor-0 nextOffset 42142 requested $offset
21/12/04 05:32:50 DEBUG KafkaDataConsumer: Get spark-kafka-source-2edd8854-b64d-4f1a-b130-135e0c2b4c56--578079920-executor rainbow_data_extractor-0 nextOffset 42143 requested $offset
21/12/04 05:32:50 DEBUG KafkaDataConsumer: Get spark-kafka-source-2edd8854-b64d-4f1a-b130-135e0c2b4c56--578079920-executor rainbow_data_extractor-0 nextOffset 42144 requested $offset
21/12/04 05:32:50 DEBUG KafkaDataConsumer: Get spark-kafka-source-2edd8854-b64d-4f1a-b130-135e0c2b4c56--578079920-executor rainbow_data_extractor-0 nextOffset 42145 requested $offset
21/12/04 05:32:50 DEBUG KafkaDataConsumer: Get spark-kafka-source-2edd8854-b64d-4f1a-b130-135e0c2b4c56--578079920-executor rainbow_data_extractor-0 nextOffset 42146 requested $offset
On this example i loss records from 42142 to 42146.
All my records are put in kafka broker in a short time.
Perhaps i have some pb in a received scheduling of records.
If someone have the right configuration of stream for my case ?
Thanks
Seb

Caffe out of memory, where is it used?

I'm trying to train a network in Caffe, a slightly modified SegNet-basic model.
I understand that the Check failed: error == cudaSuccess (2 vs. 0) out of memory error I am getting is due to me running out of GPU memory. However, what puzzles me is this:
My "old" training attempts worked fine. The network initialized and ran, with the following:
batch size 4
Memory required for data: 1800929300 (this calculates in the batch size, so it is 4x sample size here)
Total number of parameters: 1418176
the network is made out of 4x(convolution, ReLU, pooling) followed by 4x(upsample, deconvolution); with 64 filters with kernel size 7x7 per layer.
What surprises me that my "new" network runs out of memory, and I don't understand what is reserving the additional memory, since I lowered the batch size:
batch size 1
Memory required for data: 1175184180 ( = sample size)
Total number of parameters: 1618944
The input size is doubled along each dimension (expected output size does not change), hence the reason for increased number of parameters is one additional set of (convolution, ReLU, pooling) in the beginning of the network.
The number of parameters was counted by this script, by summing up the layer-wise parameters, obtained by multiplying the number of dimensions in each layer.
Assuming that each parameter needs 4 bytes of memory, that still gives data_memory+num_param*4 higher memory requirements for my old setup memory_old = 1806602004 = 1.68GB as compared to the new, memory_new = 1181659956 = 1.10GB.
I've accepted that the additional memory is probably needed somewhere, and that I'll have to re-think my new setup and downsample my input if I can't find a GPU with more memory, however I am really trying to understand where the additional memory is needed and why my new setup is running out of memory.
EDIT: Per request, here are the layer dimensions for each of the networks coupled with the size of the data that passes through it:
"Old" network:
Top shape: 4 4 384 512 (3145728)
('conv1', (64, 4, 7, 7)) --> 4 64 384 512 (50331648)
('conv1_bn', (1, 64, 1, 1)) --> 4 64 384 512 (50331648)
('conv2', (64, 64, 7, 7)) --> 4 64 192 256 (12582912)
('conv2_bn', (1, 64, 1, 1)) --> 4 64 192 256 (12582912)
('conv3', (64, 64, 7, 7)) --> 4 64 96 128 (3145728)
('conv3_bn', (1, 64, 1, 1)) --> 4 64 96 128 (3145728)
('conv4', (64, 64, 7, 7)) --> 4 64 48 64 (786432)
('conv4_bn', (1, 64, 1, 1)) --> 4 64 48 64 (786432)
('conv_decode4', (64, 64, 7, 7)) --> 4 64 48 64 (786432)
('conv_decode4_bn', (1, 64, 1, 1)) --> 4 64 48 64 (786432)
('conv_decode3', (64, 64, 7, 7)) --> 4 64 96 128 (3145728)
('conv_decode3_bn', (1, 64, 1, 1)) --> 4 64 96 128 (3145728)
('conv_decode2', (64, 64, 7, 7)) --> 4 64 192 256 (12582912)
('conv_decode2_bn', (1, 64, 1, 1)) --> 4 64 192 256 (12582912)
('conv_decode1', (64, 64, 7, 7)) --> 4 64 384 512 (50331648)
('conv_decode1_bn', (1, 64, 1, 1)) --> 4 64 384 512 (50331648)
('conv_classifier', (3, 64, 1, 1))
For the "New" network, the top few layers differ and the rest is exactly the same except that the batch size is 1 instead of 4:
Top shape: 1 4 769 1025 (3152900)
('conv0', (64, 4, 7, 7)) --> 1 4 769 1025 (3152900)
('conv0_bn', (1, 64, 1, 1)) --> 1 64 769 1025 (50446400)
('conv1', (64, 4, 7, 7)) --> 1 64 384 512 (12582912)
('conv1_bn', (1, 64, 1, 1)) --> 1 64 384 512 (12582912)
('conv2', (64, 64, 7, 7)) --> 1 64 192 256 (3145728)
('conv2_bn', (1, 64, 1, 1)) --> 1 64 192 256 (3145728)
('conv3', (64, 64, 7, 7)) --> 1 64 96 128 (786432)
('conv3_bn', (1, 64, 1, 1)) --> 1 64 96 128 (786432)
('conv4', (64, 64, 7, 7)) --> 1 64 48 64 (196608)
('conv4_bn', (1, 64, 1, 1)) --> 1 64 48 64 (196608)
('conv_decode4', (64, 64, 7, 7)) --> 1 64 48 64 (196608)
('conv_decode4_bn', (1, 64, 1, 1)) --> 1 64 48 64 (196608)
('conv_decode3', (64, 64, 7, 7)) --> 1 64 96 128 (786432)
('conv_decode3_bn', (1, 64, 1, 1)) --> 1 64 96 128 (786432)
('conv_decode2', (64, 64, 7, 7)) --> 1 64 192 256 (3145728)
('conv_decode2_bn', (1, 64, 1, 1)) --> 1 64 192 256 (3145728)
('conv_decode1', (64, 64, 7, 7)) --> 1 64 384 512 (12582912)
('conv_decode1_bn', (1, 64, 1, 1)) --> 1 64 384 512 (12582912)
('conv_classifier', (3, 64, 1, 1))
This skips the pooling and upsampling layers. Here is the train.prototxt for the "new" network. The old network does not have the layers conv0, conv0_bn and pool0, while the other layers are the same. The "old" network also has batch_size set to 4 instead of 1.
EDIT2: Per request, even more info:
All the input data has the same dimensions. It's a stack of 4 channels, each of the size 769x1025, so always 4x769x1025 input.
The caffe training log is here: as you can see, I get out of memory just after network initialization. Not a single iteration runs.
My GPU has 8GB of memory, while I've just found out (trying it on a different machine) that this new network requires 9.5GB of GPU memory.
Just to re-iterate, I am trying to understand how come my "old" setup fits into 8GB memory and the "new" one doesn't, as well as why the amount of memory needed for the additional data is ~8 times larger than the memory needed to hold the input. However, now that I have confirmed that the "new" setup takes only 9.5GB, it might not be as much bigger from the "old" one as I suspected (unfortunately the GPU is currently being used by somebody else so I can't check how much memory the old setup needed exactly)

Bear in mind that caffe actually allocates room for two copies of the net: the "train phase" net and the "test phase" net. So if the data takes 1.1GB you need to double this space.
Moreover, you need to allocate space for the parameters. Each parameter needs to store its gradient. In addition, the solver keeps track of the "momentum" for each parameter (sometimes even 2nd moment, e.g., in ADAM solver). Therefore, increasing the number of parameters even by a tiny amount can result with significant addition to memory footprint of the training system.

How does [hist] from SMLib work in Pure Data?

I put the following message into a [hist 0 100 10] object (in SMLib):
0 1 2 3 3 4 5 5 5 6 7 7 7 7 8 9 10 11 11 11 11 11 12 13 14 15 16 17 18 19 20 21 22 23 23 23 23 23 23 23 23 23 23 67 99 100 107
I then hit 'absolute' and the following is output.
6 19 18 0 0 0 0 1 0 3
I was expecting it to count the occurrences of the numbers into even bins of size 10 but only six numbers are in the first bin, and the 67 is in the wrong bin!
I counted up how it's evaluated it and got the following:
[0, 1, 2, 3, 3, 4] = 6
[5, 5, 5, 6, 7, 7, 7, 7, 8, 9, 10, 11, 11, 11, 11, 11, 12, 13, 14] = 19
[15, 16, 17, 18, 19, 20, 21, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23] = 18
[] = 0
[] = 0
[] = 0
[] = 0
[67] = 1
[] = 0
[99, 100, 107] = 3
But.. I was expecting the following result.
16 14 13 0 0 0 1 0 0 3

Fixed it!
I was using [hist 0 100 10] when I should have been using [hist 5 105 10]!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

1/3 OSD down in Ceph Cluster after 95% of the storage consume - ceph

Related

Problem combining vehicule capacities and time windows using or-tools

How ensure that parquet files contains row count in metadata?

Loss records on spark kafka stream

Caffe out of memory, where is it used?

How does [hist] from SMLib work in Pure Data?

Categories

Resources