Giter VIP home page Giter VIP logo

community-id-spec's Introduction

Community ID Flow Hashing

When processing flow data from a variety of monitoring applications (such as Zeek and Suricata), it's often desirable to pivot quickly from one dataset to another. While the required flow tuple information is usually present in the datasets, the details of such "joins" can be tedious, particular in corner cases. This spec describes "Community ID" flow hashing, standardizing the production of a string identifier representing a given network flow, to reduce the pivot to a simple string comparison.

Pseudo code

function community_id_v1(ipaddr saddr, ipaddr daddr, port sport, port dport, int proto, int seed=0)
{
    # Get seed and all tuple parts into network byte order
    seed = pack_to_nbo(seed); # 2 bytes
    saddr = pack_to_nbo(saddr); # 4 or 16 bytes
    daddr = pack_to_nbo(daddr); # 4 or 16 bytes
    sport = pack_to_nbo(sport); # 2 bytes
    dport = pack_to_nbo(dport); # 2 bytes

    # Abstract away directionality: flip the endpoints as needed
    # so the smaller IP:port tuple comes first.
    saddr, daddr, sport, dport = order_endpoints(saddr, daddr, sport, dport);

    # Produce 20-byte SHA1 digest. "." means concatenation. The
    # proto value is one byte in length and followed by a 0 byte
    # for padding.
    sha1_digest = sha1(seed . saddr . daddr . proto . 0 . sport . dport)

    # Prepend version string to base64 rendering of the digest.
    # v1 is currently the only one available.
    return "1:" + base64(sha1_digest)
}

function community_id_icmp(ipaddr saddr, ipaddr daddr, int type, int code, int seed=0)
{
    port sport, dport;

    # ICMP / ICMPv6 endpoint mapping directly inspired by Zeek
    sport, dport = map_icmp_to_ports(type, code);

    # ICMP is IP protocol 1, ICMPv6 would be 58
    return community_id_v1(saddr, daddr, sport, dport, 1, seed); 
}

Technical details

  • The Community ID is an additional flow identifier and doesn't need to replace existing flow identification mechanisms already supported by the monitors. It's okay, however, for a monitor to be configured to log only the Community ID, if desirable.

  • The Community ID can be computed as a monitor produces flows, or can also be added to existing flow records at a later stage assuming that said records convey all the needed flow endpoint information.

  • Collisions in the Community ID, while undesirable, are not considered fatal, since the user should still possess flow timing information and possibly the monitor's native ID mechanism (hopefully stronger than the Community ID) for disambiguation.

  • The hashing mechanism uses seeding to enable additional control over "domains" of Community ID usage. The seed defaults to 0, so this mechanism gets out of the way so it doesn't affect operation for operators not interested in it.

  • In version 1 of the ID, the hash algorithm is SHA1. Future hash versions may switch it or allow additional configuration.

  • The binary 20-byte SHA1 result gets base64-encoded to reduce output volume compared to the usual ASCII-based SHA1 representation. This assumes that space, not computation time, is the primary concern, and may become configurable in a later version.

  • The resulting flow ID includes a version number to make the underlying Community ID implementation explicit. This allows users to ensure they're comparing apples to apples while supporting future changes to the algorithm. For example, when one monitor's version of the ID incorporates VLAN IDs but another's does not, hash value comparisons should reliably fail. A more complex form of this feature could allow capturing configuration settings in addition to the implementation version.

    The versioning scheme currently simply prefixes the hash value with ":", yielding something like this in the current version 1:

    1:hO+sN4H+MG5MY/8hIrXPqc4ZQz0=

  • The hash input is aligned on 32-bit-boundaries. Flow tuple components use network byte order (big-endian) to standardize ordering regardless of host hardware.

  • The hash input is ordered to remove directionality in the flow tuple: swap the endpoints, if needed, so the numerically smaller IP:port tuple comes first. If the IP addresses are equal, the ports decide. For example, the following netflow 5-tuples create identical Community ID hashes because they both get ordered into the sequence 10.0.0.1, 127.0.0.1, 1234, 80.

    • Proto: TCP; SRC IP: 10.0.0.1; DST IP: 127.0.0.1; SRC Port: 1234; DST Port: 80
    • Proto: TCP; SRC IP: 127.0.0.1; DST IP: 10.0.0.1; SRC Port: 80; DST Port: 1234
  • This version includes the following protocols and fields:

    The above does not currently cover how to handle nesting (IP in IP, v6 over v4, etc) as well as encapsulations such as VLAN and MPLS.

  • If a network monitor doesn't support any of the above protocol constellations, it can safely report an empty string (or another non-colliding value) for the flow ID.

  • Consider v1 a prototype. Feedback from the community, particularly implementers and operational users of the ID, is greatly appreciated. Please create issues directly in the GitHub project at https://github.com/corelight/community-id-spec, or contact Christian Kreibich ([email protected]).

  • Many thanks for helpful discussion and feedback to Victor Julien, Johanna Amann, and Robin Sommer, and to all implementors and supporters.

Reference implementation

A complete implementation is available in the pycommunityid package. It includes a range of tests to verify correct computation for the various protocols. We recommend it to guide new implementations.

A smaller implementation is also available via the community-id.py script in this repository, including the byte layout of the hashed values (see packet_get_comm_id()). See --help and make.sh to get started:

  $ ./community-id.py --help
  usage: community-id.py [-h] [--seed NUM] PCAP [PCAP ...]

  Community flow ID reference

  positional arguments:
    PCAP         PCAP packet capture files

  optional arguments:
    -h, --help   show this help message and exit
    --seed NUM   Seed value for hash operations
    --no-base64  Don't base64-encode the SHA1 binary value
    --verbose    Show verbose output on stderr

For troubleshooting, the implementation supports omitting the base64 operation, and can provide additional detail about the exact sequence of bytes going into the SHA1 hash computation.

Reference data

The baseline directory in this repo contains datasets to help you verify that your implementation of Community ID functions correctly.

Reusable modules/libraries

Production implementations

Feature requests in other projects

Talks

Blog posts and other resources

Discussion

Feel free to discuss aspects of the Community ID via GitHub here: https://github.com/corelight/community-id-spec/issues

community-id-spec's People

Contributors

adulau avatar awelzel avatar ckreibich avatar danhermann avatar lucaderi avatar mavam avatar zwass avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

community-id-spec's Issues

The base64 representation of the ID is clunky to work with

The standard rendering of the ID (eg 1:ZEYOYMeyZNQC9DAdgsBZCtiTKqw=) is not only not very nice to look at, but can break standard string handling in SIEM pipelines and other tools (so you need to add quotes, for example). A reduced character set would work better here, at the expense of a slightly longer string.

I'm considering moving to something like base58 as a default for version 2 of the ID. Thoughts welcome.

Adding test vectors

Adding test vectors in the default directory of test pcap to ensure implementation is correct.

Sharefulness - 4 tuple

If there was a 4 tuple hash, then I could share these hashes with other people and tools, between different networks, and use them in very much the same way. Dropping the source address would mean that hash x can be applied against traffic in any network inside and outside of a particular organization. it would put the community in community ID.

Include additional flow properties in ID

The Community ID could include features beyond the flow tuple, such as the presence of particular file transfers in the flow. This could aid in disambiguation of flows with otherwise colliding IDs, but narrows applicability to monitors that are able to track at this level of inspection.

This is a feature suggestion from SuriCon 2018.

Use case: anonymity

A few folks have suggested that one could share the ID as an anonymous/pseudonymous substitute for the flow tuple, to avoid revealing the actual flow. (In analogy to sharing a file hash instead of the actual file, for example.) Applicability here seems much more narrow since the ID would most likely be of value only to the parties able to observe the underlying flow.

It may be of interest to keep the flow endpoints discernible in the ID (as a pair of hashes, perhaps) — doing so would allow checking whether one has also seen a certain endpoint in abusive behavior, etc. But that immediately leads to separating the address from the port, so we're essentially down to rendering each part of the flow tuple separately. Seems in those settings you might as well not use the ID in the first place.

I'm afraid I don't remember all individuals who have brought this up. — @vivekrj asked on Twitter, as did one participant at the 2018 Bro workshop in Karlsruhe, Germany.

Additional thoughts are very welcome.

Community ID spec appears to rely on underlying application implementation for src/dst direction determination

Hey,

While doing some research on various data we have, we've seen across two separate applications which saw the same flow, they each had their own interpretation of src/dst ordering.

You'll have to excuse the formatting but in application 1 (the data is fake but the example has happened), the flow was seen as:

src,srcp,dst,dstp
aaa,65000,bbb,443

While in application 2 it was reversed:

src,srcp,dst,dstp
bbb,443,aaa,65000

Looking at the spec itself, we appear to leave the ordering of the src/dst combination up to the individual application and if we had implemented community ID in both applications, we would have ended up with a different community ID for the same flow.

I would suggest that the spec is updated to specify how the src/dst combination should be calculated/sorted in order to ensure a consistent community ID is generated for the same flow (regardless of what the underlying application thinks the direction is).

Thanks,
Conor

Hash function performance

We've built community ID support into VAST to allow for pivoting between ingested PCAPs, Suricata, Zeek, and NetFlow/IPFIX. Our C++ implementation of community ID computation is available here.

We have observed approximately 8% loss in IPFIX ingestion performance when enabling community ID computation.

Recently we have experimented with replacing SHA1 in the community ID computation, and have had great results using xxHash. The overall performance loss went down from 8% to 3%.

I am proposing to use XXH3 (supposed to be stabilized in H1 2020) for community ID v2 to improve the usability on high-throughput paths.

Could we use fuzzy hashing to robustly incorporate time?

The time dimension is currently completely missing from the ID computation, because obvious ways of incorporating it (e.g. by rounding to the current day) impose risks of ID deviation around the rounding boundary when multiple monitors aren't perfectly in sync. Suggestions for techniques not prone to this shortcoming are very welcome.

One such suggestion is the use of ssdeep-style fuzzy hashing. I don't think it immediately applies because time representations are subject to major jumps themselves — consider a counter that jumps from 199,999 to 200,000. But here too, discussion is very welcome.

This was a talk discussion question at SuriCon 2018.

A baseline_nob64 test fails?

I reviewed this file and ran a test:

https://github.com/corelight/community-id-spec/blob/master/baseline/baseline_nob64.json

The last two test:
{"proto": 46, "saddr": "10.1.12.1", "daddr": "10.1.12.2", "sport": null, "dport": null, "communityid": "1:28794b920a095bb89f513c934a0c957e4147ccac"},
{"proto": 46, "saddr": "10.1.12.2", "daddr": "10.1.12.1", "sport": null, "dport": null, "communityid": "1:28794b920a095bb89f513c934a0c957e4147ccac"}

The community ID for this appears to be incorrect. I think this community ID came from having the IPs flipped based on my test. When I flip it so 10.1.12.2 comes first and 10.1.12.1 comes second in the byte array it gives me: 28794b920a095bb89f513c934a0c957e4147ccac

I think it should be:
1f2f15b2c1f825376e9a372350b6e06202e7f1c7

community-id.py raises TypeError: ord() expected string of length 1, but int found

Is this script Python2 only? If so, can it be updated to support Python3?

$ python3 community-id.py --verbose test.1684238207.pcap
Traceback (most recent call last):
  File "/opt/pcap/community-id.py", line 381, in <module>
    sys.exit(main())
  File "/opt/pcap/community-id.py", line 376, in main
    hasher.pcap_handle(pcapfile)
  File "/opt/pcap/community-id.py", line 158, in pcap_handle
    self._packet_handle(tstamp, pktdata)
  File "/opt/pcap/community-id.py", line 358, in _packet_handle
    print_result(tstamp, pkt, self._packet_get_comm_id(pkt, key))
  File "/opt/pcap/community-id.py", line 302, in _packet_get_comm_id
    dlen = hash_update(struct.pack('!H', self.comm_id_seed)) # 2-byte seed
  File "/opt/pcap/community-id.py", line 297, in hash_update
    hexbytes = ':'.join('%02x' % ord(b) for b in data)
  File "/opt/pcap/community-id.py", line 297, in <genexpr>
    hexbytes = ':'.join('%02x' % ord(b) for b in data)
TypeError: ord() expected string of length 1, but int found

Here test.1684238207.pcap is the output of a 1-second tcpdump and a simple HTTP request across the wire.

How Should IPv4 and IPv6 Be Compared?

IP addresses must be compared to form community IDs.

Is 192.0.2.77 greater than ::1 or less than it? One way to compare them is to upcast IPv4 address to an IPv6 address by using the IPv4-mapped range (i.e. ::FFFF:192.0.2.77). Is this what implementations are expected to do?

Could we push this to / get support from Elastic Search?

Yes. The Community ID can be computed on arbitrary data, not necessarily based on raw packets. Some concerns apply — for example, the spec requires network byte order for some hash inputs. Assuming such functionality is supported any data store could compute the ID.

Also see #5.

This was a talk discussion question at SuriCon 2018.

Suffixing timestamps as a way to include timing information

DJ Gregor suggested one could simply append the flow's timestamp to the end of the ID when wanting to filter out clashing, unrelated flows. Doing so would allow you to prefix-match on the resulting overall string, which somewhat isolates you from the clock-rollaround problems naive inclusion of timestamps would have.

Incorporating local-vs-external network vantage points

A given network flow's 5-tuple will differ depending on whether it's perceived internally, externally, before or after a NAT, etc. Can the Community ID accommodate this?

The short answer is no, since there's an assumption that the ID is based on the observed flow tuple, so presence of a NAT will make a semantically identical flow look like two different ones. However, this relates to #4 in that the ID could be based on other flow properties such as QUIC's connection ID.

This was a talk discussion question at SuriCon 2018.

additional request-reply pair ICMP/ICMPv6: Extended Echo

At https://tools.ietf.org/html/rfc8335 an additional request-reply pair is documented.
ICMP field Type: Extended Echo Request. The value for ICMPv4 is 42. The value for ICMPv6 is 160
ICMP field Type: Extended Echo Reply. The value for ICMPv4 is 43. The value for ICMPv6 is 161

Since that wasn’t in the version 1 table of the reference Corelight python, hashes using that code would set one or the other "port-like value" to the code octet instead of to the type octet of the corresponding request-response peer. This issue was also reported as: corelight/pycommunityid#2 (comment)

This specific case also highlights that: which pairs are in the table of the corresponding request-response ICMP and ICMPv6 pairs, is a specification point, and should not be subject to variation, or derived from a reference implementation outside of the specification.

Can I use this code?

This implementation looks great and I'd love to use it in my app. Do you give people permission to use the code? If so, can you add an open source license to the project? I see you have the BSD license on some of the other repos. Adding it to this one would give me permission to use this code.
Thanks!

Bad port comparison when srcIp==dstIp

Implemented the algorithm and tested the resulting ids comparing against this repo's python script and also against wireshark's community id analysis tool. My implementation's results seemed to always match wireshark's but not corelight's.

When having same ip addresses for src and dst, the following port values seem to be handled differently by corelight's script:

     ip_src   ip_dest  proto(tcp)  port_src  port_dst  community id from coreligh python script   community id from wireshark
     1.2.3.4  1.2.3.4      6         3344      1122     1:CY1T7/6B7r9W3LMnzSws9RXqqbQ=             1:3seqIXu+5y8sFuE3lLtWR/KnSWo= 
     1.2.3.4  1.2.3.4      6         1122      3344     1:CY1T7/6B7r9W3LMnzSws9RXqqbQ=             1:3seqIXu+5y8sFuE3lLtWR/KnSWo=  
     1.2.3.1  1.2.3.1      6         3344      1122     1:jD1eCyop8ZzeL/0xgO58JtpHPLE=             1:GfxcHO3Gsn1cQxpPJoBhxUMcrbU= 
     1.2.3.4  1.2.3.4      6         5566      7788     1:YZWovxgLMFDntXoQs0LgEGs9QcQ=             1:BjNbSLaYeZZX0M1egqjh1Akg9yw=   
     1.2.3.4  1.2.3.4      6         6655      8877     1:7hdcx4YnNvllNYNSbtSzegQFnjg=             1:miYC8NFsg/sTi5HwWjjyifbbp+8=   

Port combinations like: {1122, 44424}, {334, 44424}, {1133, 2244}, {8899, 1199} didn't seem to cause any issues though.

In function is_ordered(), changing port comparison in the return to:

int.from_bytes(nbo.sport, 'little') < int.from_bytes(nbo.dport, 'little')

seems to get the cases I tried to match.

Is this okay to use in other applications?

Absolutely. Suricata and Zeek are initial applications to support the ID, but wider adoption (e.g. in Snort, Wireshark, or Scapy) is explicitly encouraged. Just be aware that the implementation is likely to go substantial updates over time.

This was a talk discussion question at SuriCon 2018.

Include one or two examples in the spec

At my company, we are enriching traffic logs with community ids with an implementation of this spec that we rolled on our own. It would be nice if the spec included some examples of an input 5-tuple along with the expected community id. Then it would be easy to tell if we implemented the spec correctly.

IPv6

I am writing a program in TypeScript to use the community ID. I have the IPV4 working great, but not so with the IPV6.
I know the python packing runs and works, but is it correct?

I tried to use the verbose mode in the community-id.py to look at the buffer to validate this but it is broken. python community-id.py --no-base64 --verbose pcaps\ipv6.pcap

Your output produces the following:

1500000101.216315 | 1:fea15a78047e8057b52988cccd50ec32ffb0814e | 2607:f8b0:400c:c03::1a 2001:470:e5bf:dead:4957:2174:e82c:4887 6 25 63943
1500000101.416173 | 1:fea15a78047e8057b52988cccd50ec32ffb0814e | 2001:470:e5bf:dead:4957:2174:e82c:4887 2607:f8b0:400c:c03::1a 6 63943 25

My output with my buffer:
expanded and removed ":" from ip.
r1: 2607f8b0400c0c03000000000000001a
r2: 20010470e5bfdead49572174e82c4887
ArrayBuffer {
[Uint8Contents]: <00 00 26 07 f8 b0 40 0c 0c 03 00 00 00 00 00 00 00 1a 20 01 04 70 e5 bf de ad 49 57 21 74 e8 2c 48 87 06 00 00 19 f9 c7>,
byteLength: 40
}
f7aae5da871859418931a2ddfbf7220ecbc50412

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.