maxmind / maxmind-db Goto Github PK

View Code? Open in Web Editor NEW

266.0 33.0 62.0 2.43 MB

Spec and test data for the MaxMind DB file format

Home Page: https://maxmind.github.io/MaxMind-DB/

License: Apache License 2.0

Go 100.00%

geoip mmdb geoip2 maxmind

maxmind-db's Introduction

MaxMind DB is a binary file format that stores data indexed by IP address subnets (IPv4 or IPv6).

This repository contains the spec for that format as well as test databases.

Copyright and License

This is free software, licensed under the Apache License, Version 2.0 or the MIT License, at your option.

maxmind-db's People

Contributors

Stargazers

Watchers

maxmind-db's Issues

Need clearer statement when NodeAddress equals NodeCount

"The pointer can point to a value equal to $number_of_nodes."

Does this mean that the Pointer has a Value equal to the NodeCount?

Or does this mean that the location pointed to by this pointer will have a value equal to the NodeCount?

I'm guessing it's the former but the wording sounds more like the latter.

Request: add a California-based ip to GeoIP2-City-Test.mmdb

The CPRA (California Consumer Privacy Act) means we must treat users in, or roaming into California differently than other users in some cases. Having an ip for this location in the city test db data would be useful.

Error opening `GeoIP2-City-Test.mmdb` with maxmind.open

Hello,

This:

await maxmind.open<CityResponse>('./test-data/GeoIP2-City-Test.mmdb');

ends-up in:

Error: Unknown type 17 at offset 1                                    
        at Decoder.decodeByType (node_modules/mmdb-lib/src/decoder.ts:131:11)
        at Decoder.decode (node_modules/mmdb-lib/src/decoder.ts:78:17)
        at Object.<anonymous>.exports.parseMetadata (node_modules/mmdb-lib/src/metadata.ts:28:28)
        at Reader.load (node_modules/mmdb-lib/
src/index.ts:25:21)
        at new Reader (node_modules/mmdb-lib/src/index.ts:20:10)
        at Object.open (node_modules/maxmind/s
rc/index.ts:36:18)

"maxmind@npm:^4.3.2":
  version: 4.3.2
  resolution: "maxmind@npm:4.3.2"
  dependencies:
    mmdb-lib: 1.3.0

The CSV files are missing geo subjects returned by the MMDB

Problem

I've downloaded the latest updates of the Maxmind's GeoLite2 City database (both in MaxMind DB binary and CSV formats). When I tried to look up "88.184.98.0" here's what I got:

{"city":{"geoname_id":2982652,"names":{"de":"Rouen","en":"Rouen","es":"Ruan","fr":"Rouen","ja":"ルーアン","pt-BR":"Ruão","ru":"Руан","zh-CN":"鲁昂"}},"continent":{"code":"EU","geoname_id":6255148,"names":{"de":"Europa","en":"Europe","es":"Europa","fr":"Europe","ja":"ヨーロッパ","pt-BR":"Europa","ru":"Европа","zh-CN":"欧洲"}},"country":{"geoname_id":3017382,"is_in_european_union":true,"iso_code":"FR","names":{"de":"Frankreich","en":"France","es":"Francia","fr":"France","ja":"フランス共和国","pt-BR":"França","ru":"Франция","zh-CN":"法国"}},"location":{"accuracy_radius":5,"latitude":49.4431,"longitude":1.0993,"time_zone":"Europe/Paris"},"postal":{"code":"76100"},"registered_country":{"geoname_id":3017382,"is_in_european_union":true,"iso_code":"FR","names":{"de":"Frankreich","en":"France","es":"Francia","fr":"France","ja":"フランス共和国","pt-BR":"França","ru":"Франция","zh-CN":"法国"}},"subdivisions":[{"geoname_id":11071621,"iso_code":"NOR","names":{"de":"Normandie","en":"Normandy","es":"Normandía","fr":"Normandie"}},{"geoname_id":2975248,"iso_code":"76","names":{"de":"Seine-Maritime","en":"Seine-Maritime","es":"Sena Marítimo","fr":"Seine-Maritime","pt-BR":"Sena Marítimo"}}]}

However, there's no corresponding geoname_id for returned subdivisions in CSV files (e.g. cat GeoLite2-City-Locations-en.csv | fgrep 11071621 returns nothing). This situation is very common for subdivisions (e.g. Novosibirsk Oblast, Scotland, etc). Is it a bug or an expected behavoiur? What is the relation between the CSV files and the MMDB format?

Why is it important

For services that employ some kind of targeting or filtering of traffic based on location, the geoname_id's are important. For example, if we have an ad serving network and want to allow users to restrict a particular ad to a set of geolocations, it makes sense to describe such a set using respective geoname_id's from the CSV files and compare against it the geoname_id's returned by the MMDB format when deciding whether or not to serve the ad (depending on from which location the request came from). However, if a geoname_id is absent in the CSV files we can't use restrict the ad to the respective location even though the MMDB format returns it when resolving an IP address.

Workaround

A workaround is to manually add missing objects to the CSV files using the IDs returned by the MMDB format (although there's a lot of them to add manually), but in this case, a very important question is whether these geoname_id's are reliable or are they likely to change in the future?

Wrong calculation in spec example

http://maxmind.github.io/MaxMind-DB/
If a record contained the value 6,000, this formula would give us an offset of 4,084 into the data section.

Shouldn't the offset be:

(6000-1000)-16=4984?

Bad data that looks like it should be good data

Hi,

I've got this MMDB validation tool; I recently started running it over the test-data/MaxMind-*.mmdb test cases (as of comit 2bf1713) and encountered what looks like a few violations of the spec.

Under `test-data/MaxMind-DB-test-decoder.mmdb`

{map_key_of_wrong_type_in_data_section,
   #{key => {uint32, 1},        
     position => 120,  % index from the start of the data section
     under => #{path => [{115,{map_with,12,elements}}]}}}

there's a map with 12 elements at index 115, and one of its keys (at index 120) is an uint32 with value 1

And under `test-data/MaxMind-DB-test-pointer-decoder.mmdb`

{map_key_of_wrong_type_in_data_section,
   #{key => {uint32, 1},
     position => 9, % index from the start of the data section
     under => #{path => [{0,{map_with,15,elements}}]}}},

there's a map with 15 elements at index 0, and one of its keys (at index 9) is an uint32 with value 1

{map_key_of_wrong_type_in_data_section,
   #{key => {uint32, 7},
     position => 119,
     under =>
         #{path =>
               [{266,{map_with,12,elements}},
                {293,{pointer,108}},
                {108,{map_with,1,elements}},
                {114,{map_with,2,elements}}]}}}

there's a map with 12 elements at index 266 and, after chasing a pointer, one of its values, which is another map at index 108 containing 1 element, contains yet another map itself, at index 114, with two elements, and one of the keys of the latter is an uint32, at index 119, with value 7

The name of the tests suggested that nothing should be broken (it contains no words like "bad" or "broken"), but maybe I just assumed too much there.

Therefore, my questions are:

Are they supposed to be broken,
and if they're not, is the presence of those uint32 map keys a bug?

As of the moment I'm not capturing the tree prefixes which lead to the erroneous data pieces (for performance reasons), which is why I'm only providing you with indices, but please tell me if that's not practical.

MD5 files in "incorrect" format (GeoLite2-ASN-CSV.zip.md5)

Can the .md5 files please be generated in the "correct" format, currently they look as follows
(as taken from https://geolite.maxmind.com/download/geoip/database/GeoLite2-ASN-CSV.zip.md5)

$hash

Instead they should be in format

$hash $filename

So that md5sum -c can be used, currently you get the error md5sum: GeoLite2-ASN-CSV.zip.md5: no properly formatted MD5 checksum lines found

Typo in DB spec

In the MaxMind-DB-spec.md file, in the "Data Field Format" section, there is the following:

The first three bits of the control byte tell you what type the field is. If these bits are all 0, then this is an "extended" type, which means that the next byte contains the actual type. Otherwise, the first three bytes will contain a number from 1 to 7, the actual type for the field.

I believe the last sentence is meant to be:

Otherwise, the first three bits will contain a number from 1 to 7, the actual type for the field.

Please rename Perl 6 to Raku on dev.maxmind.com

Perl 6 was renamed to Raku in October 2019 and it is developed under its own brand since then.

I've noticed that on https://dev.maxmind.com/geoip/geoip2/downloadable/ module https://github.com/bbkr/GeoIP2 is listed as Perl 6. Please change description to Raku.

Thanks!

Custom Key Values

Can I put any custom MD5 string as key and custom json structure as Value in MMdb?
Or is there any other way of achieve this in a mmdb file format?

golang reader support

Hi:

Do you have plan to support golang reader for MMDB?

IPv4 in 2001::/32 scope in GeoIP2-Country-Test.mmdb

Hi,

I've run some dumping tests with GeoIP2-Country-Test.mmdb file. I was wondering why IPv4s are present in the 2001::/32 scope.

Let's take 67.43.156.0 for example. Following hexadecimal values are matched when iterating over the binary tree :

0x432b9c00
0xffff432b9c00
0x20010000432b9c000000000000000000
0x2002432b9c0000000000000000000000

As far as I know, IPv4 can be wrapped in ::/96, ::ffff:0:0/96 and 2002::/16 IPv6 scopes but it should not be so in 2001::/32. I'm not much at ease with IPv6 so maybe I'm missing something.

Thanks

Maximum size of database metadata area is ambiguous

The spec says that the maximum size of the database metadata portion of the datafile is "128kb." Is that kilobits or kilobytes, and, is kilo measured as 10^3 or 2^10?

Does the production DB have some IPs which are guaranteed to return a particular value for different fields?

Hey! we want to check the GEO data fields for a some IPs in our production integration test, it will help us guard again failed auto geoip db updates.

We are not interested in swapping out the production dbs. Is there a solution for us?

Node tree Zero or One based?

Please document whether the nodes in the IP Search tree are Zero or One based?

Starting at the beginning of the file do they go from 0 to Nodecount-1 or from 1 to Nodecount?

Common IPv4 and IPv6 addresses across IP-based databases

Thank you for providing the test databases. These make testing much easier. It would be very helpful, though, if there was at least one common IPv4 and IPv6 address across the IP-based databases. Otherwise the test databases can't readily be used in conjunction with one another. For example, testing a process that adds City and ASN information to metadata related to a given IP address, we can only do a portion of this at a time because there does not appear to be an IPv4 or IPv6 address that is common across any of the databases that we could use.

Thanks!

Please add "native" language to your databases.

GeoIP2 databases introduced translations, but the fact that they are forced is really annoying.

Let's say I have a bunch of users with recognized IPs and I want to find users from Gdańsk, Kraków and Warszawa. Those from Gdańsk have english name "Gdańsk" - kinda counterintiutive because it is not an English word, but OK. Those from Kraków have english name "Krakow" - which makes things harder because "Cracow" is alternate spelling so I have to know which one was chosen in MM database. And those from Warszawa have english name translated to "Warsaw".

This makes creating any sane queries almost impossible. How the hell do I know if there is an English translation for Warząchewka or Szczebrzeszyn? And this goes both way - I want to show to Polish user that "Twój ostatni login był 2 dni temu z miasta Warszawa" (your last login was 2 days ago from Warsaw) not "Twój ostatni login był 2 dni temu z miasta Warsaw".

So my idea is to add native name next to returned languages, for example:

    'subdivisions' => [
        {
            'geoname_id' => 3337496,
                'name' => 'Pomorskie',      <==== like this
                'names' => {
                'ru' => 'Поморское воеводство',
                'es' => 'Pomerania',
                'ja' => 'ポモージェ県',
                'de' => 'Woiwodschaft Pommern',
                'fr' => 'Voïvodie de Poméranie',
                'en' => 'Pomerania'
            },
            'iso_code' => 'PM'
        }
    ]

Is it possible?

Thanks!

Source data doesn't match test data

In a specific case under GeoIP2-Static-IP-Score-Test.json, recently modified in order to use floating point values, the JSON value 0.3 is corrupted when generating MMDB test data.

The problem is, 0.3 has no precise representation in binary floating point - its closest representable value is 2.99999999999999988898e-01. But in this case, it's read as 0.30000000000000004 instead, which is the value to be found within GeoIP2-Static-IP-Score-Test.mmdb.

I boiled the issue down to what's (presumably) a bug in the Cpanel::JSON::XS library, which is used to decode the JSON sources.

Here's a small perl script that reproduces the issue:

use Cpanel::JSON::XS qw( decode_json );

my $literal_float = 0.3;
my $json_object = decode_json('{"value": 0.3}');
my $json_float = $json_object->{value};

my $difference = $literal_float - $json_float;
print "$difference\n"

It outputs -5.55111512312578e-17. Mind. Blowing.

Clarify the maximum metadata section size.

"The maximum allowable size for the metadata section, including the marker that starts the metadata, is 128kb."

b = bits
B = Bytes

Is it really 128k bits (an odd way to express this) or do you actually mean 128kB (ie bytes)?

Conflicting description of calculating address into data region

The descriptions showing how to calculate addresses into the data section are in conflict and the example doesn't help (I think it's wrong).

$offset_in_file = ( $record_value - $node_count )
+ $search_tree_size_in_bytes + 16

I think the above is correct, but it conflicts with the description below and the example given.

$data_section_offset = ( $record_value - $node_count ) - 16

The -16 should either be deleted, or it should be +16 if referenced from the end of the search tree.

Which test database can be used to test bigger data sizes?

I'm talking about sizes described by control byte and next byte.

I haven't encountered any values that require $size >= 30 to decode.

Recent tarballs unreadable by older tar implementations

(Not sure if this the appropriate place to report this.)

The numeric uid/gid of the files contained in the latest GeoLite2 database tarballs have changed from 2000 to 1991400094[*].

This makes the tarballs incompatible with older implementations of tar due to the new very large identifier. Sadly, the implementation bundled with Erlang/OTP up to version 19 is one of those, and my .mmdb reader is now unable to load the latest tarballs on not-so-old Erlang versions.

GeoLite2-Country.tar.gz (May 14th)

$ tar tvfz GeoLite2-Country.tar.gz --numeric-owner   
drwxr-xr-x 2000/2000         0 2019-05-13 19:20 GeoLite2-Country_20190514/
-rw-r--r-- 2000/2000        55 2019-05-13 19:20 GeoLite2-Country_20190514/COPYRIGHT.txt
-rw-r--r-- 2000/2000       433 2019-05-13 19:20 GeoLite2-Country_20190514/LICENSE.txt
-rw-r--r-- 2000/2000   3798346 2019-05-13 19:20 GeoLite2-Country_20190514/GeoLite2-Country.mmdb

GeoLite2-Country.tar.gz (May 21st)

$ tar tvfz GeoLite2-Country.tar.gz --numeric-owner   
drwxr-xr-x 1991400094/1991400094       0 2019-05-20 21:12 GeoLite2-Country_20190521/
-rw-r--r-- 1991400094/1991400094     433 2019-05-20 21:12 GeoLite2-Country_20190521/LICENSE.txt
-rw-r--r-- 1991400094/1991400094      55 2019-05-20 21:12 GeoLite2-Country_20190521/COPYRIGHT.txt
-rw-r--r-- 1991400094/1991400094 3769909 2019-05-20 21:12 GeoLite2-Country_20190521/GeoLite2-Country.mmdb

[*]: Username and group name remain the same: tjmather

Test database doesn't correctly handle `is_in_european_union`

It appears the GeoIP2-City-Test.mmdb doesn't correctly handle is_in_european_union when performing a look up on a test IP address.

For example, 2a02:cfc0::29 (FR record used from GeoIP2-City-Test.json) yields the correct GeoIP2 data but IsInEuropeanUnion is false. This happens for, what appears to be, all European Union sample data. Please advise if I am missing something!

Source data

{
      "2a02:cfc0::/29" : {
         "continent" : {
            "code" : "EU",
            "geoname_id" : 6255148,
            "names" : {
               "de" : "Europa",
               "en" : "Europe",
               "es" : "Europa",
               "fr" : "Europe",
               "ja" : "ヨーロッパ",
               "pt-BR" : "Europa",
               "ru" : "Европа",
               "zh-CN" : "欧洲"
            }
         },
         "country" : {
            "geoname_id" : 3017382,
            "is_in_european_union" : true,
            "iso_code" : "FR",
            "names" : {
               "de" : "Frankreich",
               "en" : "France",
               "es" : "Francia",
               "fr" : "France",
               "ja" : "フランス共和国",
               "pt-BR" : "França",
               "ru" : "Франция",
               "zh-CN" : "法国"
            }
         },
         "location" : {
            "accuracy_radius" : 100,
            "latitude" : "46",
            "longitude" : "2",
            "time_zone" : "Europe/Paris"
         },
         "registered_country" : {
            "geoname_id" : 3017382,
            "is_in_european_union" : true,
            "iso_code" : "FR",
            "names" : {
               "de" : "Frankreich",
               "en" : "France",
               "es" : "Francia",
               "fr" : "France",
               "ja" : "フランス共和国",
               "pt-BR" : "França",
               "ru" : "Франция",
               "zh-CN" : "法国"
            }
         }
      }
   },

Yielded result

&{{0 map[]} {EU 6255148 map[ru:Европа zh-CN:欧洲 de:Europa en:Europe es:Europa fr:Europe ja:ヨーロッパ pt-BR:Europa]} {3017382 false FR map[pt-BR:França ru:Франция zh-CN:法国 de:Frankreich en:France es:Francia fr:France ja:フランス共和国]} {0 46 2 0 Europe/Paris} {} {3017382 false FR map[de:Frankreich en:France es:Francia fr:France ja:フランス共和国 pt-BR:França ru:Франция zh-CN:法国]} {0 false  map[] } [] {false false}}

iso-codes within database

Is there any reason to store iso-codes for different languages inside binary database?

Diagram for 28 bit (medium db) discrepancy

This doesn't seem like the right layout, they should be 2 big-endian unsigned ints, one after another right?

https://github.com/maxmind/MaxMind-DB/blob/master/MaxMind-DB-spec.md#28-bits-medium-database-one-node-is-7-bytes

The first int is shown with its higher order bits out of place.

Logic behind metadata section

I'm having a hard time understanding the logic behind where the metadata is placed / how it is meant to be found. It seems that this information would belong in a header that can be efficiently read when opening the file. The spec pretty much forces you to search through the entire file at least once to find this section, even if you only want to lookup a single address. At first I figured this cannot be what was intended, but indeed it seems this is what all the implementations do in practice.

What am I missing? Why is this placed at the end of the file, without even an offset to find it?

Position of IPv4 space in code and in documentation

Documentation:
The strategy that MaxMind uses for its GeoIP databases is to include a pointer from the ::ffff:0:0/96 subnet to the root node of the IPv4 address space in the tree.
Code of libraries use loop with node = self._read_node(node, 0) - efectively ::0000:0:0/96

Until the recent GeoIP2 Connection Type database, looking up IPv4 space at ::ffff:0:0/96 worked fine, but now node 82 points to data instead of another node, so only ::0000:0:0/96 works now,

Can you comment on this? Is there anything I'm missing?

Signed 32-bit integer documentation is unclear

Currently the docs state, "When storing a signed integer, the left-most bit is the sign. A 1 is negative and a 0 is positive." However, I believe this is only true if the size is 4 bytes, at least going by all of our implementations. For example, if the stored size of the int32 is 1 byte, we treat 10000001 as 129 rather than -127.

Gibberish text in ASN database

Hi,

I'm the maintainer of locus, an .mmdb reader library for Erlang. I've been working on a new analysis feature that looks for database corruption / incompatibilities (or bugs in my code instead.)

When I fed January 15th's GeoLite2 ASN database to this new analyzer, a single data record was rejected:

* {data_record_decoding_failed,
      #{class => error,data_index => 1984015,
        reason =>
            {not_utf8_printable_text,
                <<67,104,105,110,97,32,116,101,108,101,99,111,109,32,195,131,
                  194,162,195,130,194,128,195,130,194,147,32,67,104,105,110,
                  97,32,78,101,120,116,32,71,101,110,101,114,97,116,105,111,
                  110,32,73,110,116,101,114,110,101,116>>},
        tree_prefix => {{8193,3176,0,0,0,0,0,0},32}}}

The tree prefix is "2001:c68::/32"
The data record index is 1984015 (from the beginning of the data section)
The string failing validation is "China telecom Ã¢Â�Â� China Next Generation Internet"

Alas, it is valid UTF-8, and the spec doesn't say it has to be printable. But I found it strange, and hence I'd be thankful for your thoughts on this.

(And if possible, a definitive say on whether non-printable codepoint sequences like the one above are to be permitted.)

The mentioned database metadata:

  #{<<"binary_format_major_version">> => 2,
    <<"binary_format_minor_version">> => 0,
    <<"build_epoch">> => 1547570011,
    <<"database_type">> => <<"GeoLite2-ASN">>,
    <<"description">> =>
        #{<<"en">> => <<"GeoLite2 ASN database">>},
    <<"ip_version">> => 6,
    <<"languages">> => [<<"en">>],
    <<"node_count">> => 720869,<<"record_size">> => 24},

Add localization for Country names

Hello!

When I request country data from GeoLite2-Country.mmdb from, for instance, PHP, I receive localized name of a country

87.116.132.149
GeoIp2\Model\Country Object
(
    [raw:protected] => Array
        (
            [continent] => Array
                (
                    [code] => EU
                    [geoname_id] => 6255148
                    [names] => Array
                        (
                            [de] => Europa
                            [en] => Europe
                            [es] => Europa
                            [fr] => Europe
                            [ja] => ヨーロッパ
                            [pt-BR] => Europa
                            [ru] => Европа
                            [zh-CN] => 欧洲
                        )

                )

            [country] => Array
                (
                    [geoname_id] => 6290252
                    [iso_code] => RS
                    [names] => Array
                        (
                            [de] => Serbien
                            [en] => Serbia
                            [es] => Serbia
                            [fr] => Serbie
                            [ja] => セルビア
                            [pt-BR] => Sérvia
                            [ru] => Сербия
                            [zh-CN] => 塞尔维亚
                        )

                )

The country/continent names are localized into just few languages. Particularly, no translation into Serbian language.

I found information that you are using GeoNames as source data, and it does have translation into all languages - https://www.geonames.org/RS/other-names-for-serbia.html

Is it possible to add more translations for countries? At least into languages, native to this country.

Thanks!

Overlapping networks

What is the expected behavior when there are (partially) overlapping subnets? Is it supported? What is returned in that case? All the matches? The smallest subnet?

Example, suppose we have 3 subnets:
10.0.0.0/8
10.1.0.0/16
10.1.1.0/24

When we query the IP 10.1.1.1, what is supposed to be returned? This IP belongs to all three networks.

Testing the current perl reader implementation, it is not clear what happens, it seems to vary.

record size vs record value

It appears that the terms record size and record value (and vars record_size, record_value) could be referring to the same.
If so could the naming be changed to make them consistent.

Possibility to generate testing database instead of using downloaded binary files

Hi, is there is a way/tool/api to generate testing databases in mmdb format based on a given input? Asking in context of adding Maxmind support to Envoy geolocation filter, where we have to check in testing binary databases from this repo and is not the most optimal approach from security and code readability standpoints.

How to obtain testfiles for CI without running into licensing issues?

Is it okay to copy the files test-data/GeoLite2-City-Test.mmdb and test-data/GeoLite2-Country-Test.mmdb into other (MIT licensed) projects ? The CC-BY-SA 3.0 license on this repository seems to forbid it (as I understand the license, it would be viral)

If not, are these or similar testdata files available somewhere ? (so we can avoid our CI builders from hitting the download limits)

maxmind / maxmind-db Goto Github PK

maxmind-db's Introduction

Copyright and License

maxmind-db's People

Contributors

Stargazers

Watchers

Forkers

maxmind-db's Issues

Problem

Why is it important

Workaround

Under test-data/MaxMind-DB-test-decoder.mmdb

And under test-data/MaxMind-DB-test-pointer-decoder.mmdb

GeoLite2-Country.tar.gz (May 14th)

GeoLite2-Country.tar.gz (May 21st)

Source data

Yielded result

Recommend Projects

Recommend Topics

Recommend Org

Under `test-data/MaxMind-DB-test-decoder.mmdb`

And under `test-data/MaxMind-DB-test-pointer-decoder.mmdb`