Giter VIP home page Giter VIP logo

cardinal-rs's People

Contributors

dependabot[bot] avatar jpmckinney avatar yolile avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cardinal-rs's Issues

Allow bid status to be absent

Allow a user to indicate the dataset as using a single bid status (i.e. valid). We can then calculate indicators as if the status were set. (I don't yet know how common this is.)

Add support for lots

Most publishers do not implement lots, so for now the indicator calculations are simpler / naive.

  • R024 The winning offer is just below the next lower offer: If lots are present, we can relax the requirement for there to be a single award, since we'll be able to determine which bids are competing for which lot (i.e. whom each awardee was competing with).

TBD whether to include lot IDs in the output (probably), re: #25.

Use parties/identifier instead of organization reference ID

Some publishers have good organization reference IDs, e.g. DR sets it to "{parties/identifier/scheme}-{parties/identifier/id}".

Other publishers don't, and we'll need to construct IDs from parties/identifier as above.


Once this is implemented, we can add a configuration option to opt-in to fallback to (aka trust) organization reference ID (faster, and useful if the user knows that IDs are well-constructed in cases where parties/identifier is not populated).

Edit: This configuration is useful, because cross-referencing parties is annoying.

prepare: Add option for user to quiet an issue without fixing it

For example, DR's current dataset has at least one bid without a status in 1% of its procedures, but this occurs both when there are awards and not, and when there are other bids with statuses. So, it's not really knowable.

Since this issue produces about 3000 lines of text, it would be nice to be able to quiet it, e.g.

[quiet]
missing_bid_status

Potential indicator methodology changes (R003, R024, R025)

R003 Short submission period

OCDS 1.2 adds tender/expressionOfInterestDeadline. How should this field be used, if present?

“Short submission period” in An Objective Corruption Risk Index Using Public Procurement Data suggests taking weekends (and holidays) into account:

Abuse of weekends is possible as legally required time periods are defined in calendar days so the effective time companies would have for bid preparation can further be decreased by including weekends and national holidays in the submission period.

R024 The winning offer is just below the next lower offer

Presently, we require there to be a single supplier and single tenderer.

In the case of consortia – for example – it's possible for the two fields to contain the same IDs. It's perhaps useful to extend the methodology to cover this case, e.g. one consortium colludes with others.

However, we've also observed cases where they are different, e.g. the buyer awards items to multiple bidders who submitted individually. (I'm not sure that this case is rescuable – there's probably no option except to skip such cases).

R025 The ratio of winning bids to submitted bids for a top tenderer is a low outlier

James: I don't know what to do for multiple winning bids, bids with multiple tenderers, or awards made to multiple suppliers.

Camila: For this cases I would suggest to count each bid and award separetely, for instance. If a bid has 2 tenderers, and both win, each tenderer would have 1 bid and 1 award.

New command: statistics

It can be useful to report out some order statistics and distributions that are relevant to indicators. For example:

  • Distribution of procurementMethod codes: so that the user can evaluate if the distribution of open, selective, limited, direct conforms to their knowledge of the procurement market
    • From user research: "A methodology should also come with clear risk warnings for instance the use of certain fields. Are there some fields that we know are problematic when it comes to bias in the data? (e.g. Could there be a bias toward using 'selective' instead of 'limited' in procurementMethod?)"

We might also consider reporting:

  • Some priority quality issues (e.g. incoherent dates).
  • Outliers. If there is demand, we can also change indicators command to ignore outliers.
  • Order statistics (possibly per procurement method) to assist the user in setting threshold values

Ideas for new flags, while implementing other flags

While working on R035:

  • A bid has not been evaluated, but all awards are finalized
  • Bids are withdrawn if not submitted by the single tenderer of the winning bid (i.e. other bidders only submit to simulate competition)

For methodology changes to existing flags, see #17

Improve documentation

Once API stabilizes, after more features are added.

  • Generate a ReadTheDocs website
  • Add fictional, narrative examples in blockquotes (to ease understanding of the indicator)
  • Add more command invocation examples (and run doc-tests)

  • init command
  • prepare command
  • indicators command

Reference: Rust libraries

Having gone through the top 250 at https://lib.rs/std, some libraries not already in use:

Testing

Probably not relevant:

Also:

CLI

Also:

Errors

flatterer uses https://docs.rs/snafu/latest/snafu/

Calculations

Other

Performance

Consider support for string "amount" values

We expect this to be a very rare occurrence.

The OCDS implementation in Belarus had a case where a number in the source system was badly formatted, and the publication therefore published it as a string.

However, in that case, we won't have much luck converting to float - making assumptions about whether comma or period is used as the decimal point, etc. might just lead to incorrect results.

Leaving this issue open for now, but I think it might be wontfix.

CLI option/command to add metadata relevant to BI reporting

Review the frontend mockups and requirements to check for other fields by which results are filtered/aggregated.

Always included:

  • flag ID
  • result metadata
  • primary ID (OCID if indicator can be calculated per contracting process, buyer/supplier ID if indicator must be calculated for that buyer/supplier across the entire dataset)

Opt-in (BI) metadata:

  • flag category
  • date(s) (which?)
  • secondary IDs (e.g. the buyer and suppliers involved in the flagged contracting process, or vice versa)
  • ...

Requested by DR:

  • process stage (whether awards is set), aka awarded or unawarded
  • tender/procurementMethodDetails (9-value codelist)
  • tender/startDate year

After #14, we probably want to add the lot ID.

Add release workflow


Docs automatically generated at docs.rs

Things we probably won't do:

No longer up-to-date or otherwise irrelevant:

Report the indicator's coverage (application, pass, fail counts and total)

For example, Pelican reports "pass", "fail" and "not applicable" for quality checks.

Cardinal presently only reports "fail" for red flags.

It might be useful to be able to review the N/A results.

This would involve, at minimum, storing:

  • the result (pass, fail, N/A), as an Option<bool>
    • Or, just fail and N/A – storing something for every "pass" is probably just bloat
  • the reason

Pelican also stores other metadata, like application_count and pass_count for checks that operate on arrays, and then easily accessible metadata to understand why the check failed (e.g. the paths to the fields that caused the failure).

Reference: Compiled release sizes

In my Rust testing, I'm setting an initial capacity of 1 MiB for the vector of characters of a compiled release.

The table below is for jobs in the data registry (some jobs are for the same collection). The maximum is 147 MiB (!), the second highest is 24 MiB.

Starting with a capacity of 1 MiB, there would be at most 8 reallocations for the one above 128 MiB, 5 reallocations for those few above 16 MiB, and less for the rest. This seems fine.

Noting that if we didn't set an initial capacity, the initial capacity would be set when first pushing onto the vector. For job 696, the shortest line is 933 bytes, which would take 10-11 reallocations to get to 1 MiB (its longest line is a little over 2 MiB). For that job, we would have a total of 2 reallocations by starting with 1 MiB, instead of 12 in the worst case (i.e. if the shortest line were the first line). I haven't taken the time to compare to a median-length first line.

Longest line:

find . -name 'full.jsonl.gz' -exec sh -c 'echo {}; gunzip -c {} | awk "{ if (length > L){L=length} }END{print L}"' \;

Shortest line:

find . -name 'full.jsonl.gz' -exec sh -c 'echo {}; gunzip -c {} | awk "{ if (L == \"\" || length < L){L=length} }END{print L}"' \;
bytes job id
2051 711
2051 742
2051 766
2071 468
2071 556
2071 592
2071 625
2071 679
2628 532
2628 581
2628 617
2628 664
2628 699
2628 735
2628 757
2709 422
3072 340
3072 384
4457 347
4604 470
5260 323
5304 426
5326 415
5398 789
5475 353
5475 362
5475 445
5475 549
5475 589
5475 622
5475 673
5475 740
5475 764
5716 576
5741 521
5990 552
5990 591
5990 624
5990 675
5990 709
5990 741
5990 765
6010 467
7362 670
7362 738
7362 762
7367 324
7367 379
7367 483
7367 493
7367 563
7367 597
7367 658
8895 471
10054 487
11201 473
11201 554
11201 593
11201 626
11201 680
11201 712
11201 743
11201 767
12689 610
12737 474
13279 387
13279 728
14098 392
14098 537
14223 386
14231 727
14488 320
14488 378
14488 481
14488 494
14488 564
14488 598
14488 683
14488 717
14488 746
16585 660
16585 697
16585 791
19856 503
19856 568
19856 603
19856 686
19857 718
20474 370
20474 447
20474 550
20474 588
20474 623
20474 674
20474 739
20474 763
21073 402
21471 491
22713 385
22850 726
22923 770
22930 748
23466 357
23466 363
23466 520
23466 574
23466 608
23466 676
24460 583
24460 618
24460 665
24460 700
31868 497
31868 566
32185 373
32189 448
37163 600
37163 685
41248 551
41248 590
41273 388
41273 389
50147 395
54573 790
55566 691
56411 522
56411 575
56411 609
56411 692
56411 724
56411 753
59588 546
59588 586
59588 621
59588 668
59588 703
59588 734
59588 758
59590 428
59749 570
59759 514
59763 367
64710 716
64710 793
67060 329
67060 381
67390 562
67729 409
75089 350
75089 361
75089 444
89443 534
91096 405
93306 424
109972 672
117903 427
121715 337
121715 430
121715 547
124777 393
124777 536
124777 580
124777 616
133466 687
133726 504
133726 573
133726 607
133726 723
133726 749
133726 773
134096 719
134096 752
134256 602
137126 771
148127 750
155727 326
155727 380
155727 484
155727 490
155727 560
155727 596
155727 628
155727 682
155727 714
155727 745
155727 774
164035 472
177309 778
181159 528
184990 359
184990 515
185160 567
196407 327
205987 539
206986 441
214911 425
214911 543
214911 585
214911 620
214911 667
214911 702
214911 736
214911 756
242554 524
256279 341
256515 418
259557 343
265329 414
272611 780
278653 511
278653 606
278653 690
305333 578
308637 615
308637 662
308637 698
308637 733
308637 761
324811 330
344030 390
344030 730
346808 399
408222 737
408222 775
441544 407
449754 725
481791 342
484040 345
532518 512
539538 419
566587 529
601246 349
601246 787
622821 413
662541 526
662683 577
662683 611
706831 694
706831 754
760184 410
880634 542
927369 421
1475448 406
1478839 339
1478839 383
1478839 498
1478839 569
1478839 604
1478839 688
1478839 720
1478839 747
1478839 772
1526742 671
1605718 525
1642379 322
1642379 377
1642379 486
1697247 559
1697247 594
1697247 627
1697247 681
1697247 713
1697247 744
1697247 768
2152041 612
2152041 656
2152041 696
2152041 732
2152041 779
2550939 684
2553305 715
2553305 792
2553739 496
2553739 565
2553739 599
2790994 404
4357161 401
4617653 659
4737232 396
4775975 572
5528742 695
5528742 731
5528742 776
7547759 519
7547759 571
7547759 605
7547759 689
10082242 403
10930049 348
10930049 364
13145632 777
13742917 408
18428801 530
20622679 666
20622815 584
20622815 619
20623099 701
20623145 794
22990409 523
25435453 398
154296486 346

Allow values to be absent

Allow a user to indicate the dataset as using a single value for a given field. We can then calculate indicators as if the field were set.

  • .../items/classification/scheme = 'UNSPSC' (observed for DR)
  • Value/currency (I'm not aware that any publisher omits currency entirely)
  • awards/status = 'active' (not sure how frequently this field is unset)
  • bids/details/status = 'active' (not sure how frequently this field is unset)

prepare: Fill in fields based on parties roles

For example, if buyer and procuringEntity are not set, but parties/roles contains 'buyer' or 'procuringEntity', we can fill in these fields, before calculating indicators. Similarly:

  • If there is only one active award, we can set awards/suppliers with role 'supplier'
  • if there is only one active award and no lots, we can set tender/tenderers with role 'tenderer'

Performance improvements

serde should be fine, as our own code is the bottleneck. That said, simdjson is the fastest JSON parser. https://github.com/SunDoge/simdjson-rust tracks version 2.2.2 of simdjson, but is not available on crates.io. https://github.com/simd-lite/simd-json tracks 0.2.x (issue).

I also tried the following on the coverage code:

We can try them again once the code is more complex.

crossbeam is a better channel implementation. It will replace std::sync::mpsc in Rust 1.6.7 (January 2023).


In case it's relevant in the future, here is a fast way to read a file line-by-line:

// Compiled releases of multiple MiBs have been observed, but most are less than 1 MiB.
const CAPACITY: usize = 1024 * 1024;

let file = File::open(config.path)?;
let mut line = Vec::with_capacity(CAPACITY);
let mut reader = BufReader::new(file);

while reader.read_until(b'\n', &mut line).unwrap_or(0) > 0 {
    let value: Value = serde_json::from_slice(&line)?;
    // ...
    line.clear();
}

It is faster because it only allocates memory for one line. It can't be used if the line is passed to a thread for parsing; instead memory needs to be allocated for each line (i.e. using for line in reader.lines()).

Have another look through statistics crates

Using statrs currently (most popular). There might be some newer crates that meet our needs better.

medians has medinfof64. It's a single-author library (along with rstats), and has less usage.

qsv-stats

qsv-stats performs a sort - O(n log n) - to calculate quartiles. statrs uses a selection algorithm – O(n).

For DR bid ratios, numpy calculates 0.25580327. qsv-stats got 0.2560257847899094 (0.00022 diff). statrs got 0.2559516146277174 (0.00014 diff). In other words, no major difference.


Also looking at https://docs.rs/watermill/latest/watermill/ for online statistics.

ADR: watermill's quartile calculation is non-deterministic. I think that means we should not use that feature, as I expect it will be confusing to users to get different results (more or fewer flags) on different runs.

prepare: Consider changing stdout and stderr to mandatory options

e.g. --output (-o) and --error (-e).

Presently, if a user doesn't use redirection (e.g. > prepared.json 2> issues.csv), then they get a mix of both in the console output.

Also, I think Windows users can have some challenges around redirection. Looks okay, actually.

For implementation, we'll probably want an intermediate buffer, that is then written to the output file at the end of the thread. Otherwise, we could have characters mixed across threads.

Fix Homebrew release

Add to the bottom of release.yml (after addressing TODO)

  bottle:
    needs: release
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        # macos-13 is not available. https://github.com/actions/runner-images/issues/6426
        include:
          - name: arm64_monterey
            os: macos-12
            target: aarch64-apple-darwin
          - name: arm64_big_sur
            os: macos-11
            target: aarch64-apple-darwin
          - name: monterey
            os: macos-12
            target: x86_64-apple-darwin
          - name: big_sur
            os: macos-11
            target: x86_64-apple-darwin
          - name: catalina
            os: macos-10.15
            target: x86_64-apple-darwin
          - name: x86_64_linux
            os: ubuntu-latest
            target: x86_64-unknown-linux-gnu
    steps:
      - id: setup-homebrew
        uses: Homebrew/actions/setup-homebrew@master
      # TODO need to update the url and sha256, otherwise brew install complains
      # try with checkout and git commands
      # https://lannonbr.com/blog/2019-12-09-git-commit-in-actions/
      - run: brew tap open-contracting/tap
      - env:
          CARGO_TARGET: ${{ matrix.target }}
        run: brew install --build-bottle --verbose ocdscardinal
      - run: brew bottle --no-rebuild --verbose ocdscardinal
      - env:
          GH_TOKEN: ${{ github.token }}
        run: gh release upload ${{ github.ref_name }} ocdscardinal--${{ github.ref_name }}.${{ matrix.name }}.bottle.tar.gz

brew install ... errors with:

==> Verifying checksum for '7c4169757f272594850bc4fae4e03439505521618da28851cbe1ae7226e5dd96--cardinal-rs-0.1.0.tar.gz'
Error: ocdscardinal: SHA256 mismatch
Expected: 8408aea9b1f47369e07697c4bd2411179e18fa5c1e9fe5b79b9f2ff1dd712323
  Actual: 7881c01b85fe3088643faf757d90d2267b52aaafddbe5a45bab464b6d42430fd
    File: /home/runner/.cache/Homebrew/downloads/7c4169757f272594850bc4fae4e03439505521618da28851cbe1ae7226e5dd96--cardinal-rs-0.1.0.tar.gz
To retry an incomplete download, remove the file above.

See also https://github.com/marcprux/update-homebrew-formula-action

When I last looked at this, I was also looking through org:fair-ground HOMEBREW_FAIRTOOL_ARCH on GitHub Search.

Need to also fix GitHub Actions on https://github.com/open-contracting/homebrew-tap/actions

prepare: Add currency conversion

Convert amounts if there are multiple currencies.

Use same approach as in pelican-backend: https://pelican-backend.readthedocs.io/en/latest/api/util/currency_converter.html

In Pelican we convert all values to USD. This means there will be very many conversions, even if 99% of the dataset is in another currency. However, I think supporting conversion to any currency requires more API cals to fixer.io (assuming it has rates between all currency pairs – I think conversion to USD has the best coverage).

Access to conversion rates is not free, in general. This feature would need to be opt-in, with the user supplying a fixer.io API token via the configuration file. (We can consider other sources, but I think fixer is pretty good.)

Amounts are compared in the fold step, so we already need to know by that point whether conversion is required. As such, the tool will need to be instructed (via configuration) to perform conversion from the start.

The default behavior can be to warn about multiple currencies, and otherwise ignore other currencies.

Use configuration file to opt-in to each indicator

The default (and template) configuration file can contain all indicators, along with in-line documentation about their options.

This will also make it easier to isolate tests (presently, testing one indicator might return results for another).

Apple code signing

R036: Add logic for awardCriteria

Right now, the indicator just ignores what the awardCriteria is, entirely.

We could add an option to the prepare command to set awardCriteria to 'priceOnly' if the lowest valid bid is awarded.

Add to test.rs for indicators and prepare commands

prepare

  • stringify IDs
  • defaults
  • redactions: amount
  • redactions: organization_id
  • codelists: BidStatus
  • codelists: AwardStatus
  • invalid JSON
  • non-object JSON
  • errors CSV file output

indicators

  • no_price_comparison_procurement_methods / price_comparison_procurement_methods (R028, R036, R024/R058) 6f57bd2 73245c5
  • maps
  • mixed currencies warning (R024/R058)
  • global exclusions (is_cancelled_contracting_process)

prepare: Options for datasets without bids (e.g. R025)

For R025 (Excessive unsuccessful bids), can consider filling in bids/details according to tender/tenderers, making the assumption that status is 'valid' and that each tenderer submitted a separate bid.

From discussion:

James: I'm not sure how this indicator should be modified if only tender/tenderers is available

Camila: I agree that the best option is to have bids information, where the bids are not disqualified or withdrawn. We should prioritize this and recommend the use of this field. However with tenderers/id, you could still calculate the success rate, with the limitation that you could be counting disqualified or withdrawn bids. We could highlight that, or maybe just calculate it with bids fields and in the methodology we could mention this alternative to users.

R044: More robust address matching

For example, dedupe (as I remember) applies address normalization (for at least US addresses). If we follow the same approach, we'd need to implement appropriate normalization for different jurisdictions. This strategy uses equality tests, but allows for some address components to be missing (e.g. "Main" vs "Main St"). I know Roberto Rocha recently evaluated a few different strategies when merging Canadian political donation datasets.

I think naive fuzzy matching will yield too many false positives (e.g. 1 Main St, Podunk, New York, USA 12345 and 100 Main St, ... are very close typographically, but are not at all the same address).

The first implementation could just do simple equality.

The metadata for this indicator should include a measure of similarity (percentage or otherwise).

Contributor documentation

Principles

  • Results should be stable. It's okay for an update to a contracting process to cause a red flag to be newly raised. However, it is disfavored for an update to cause a red flag to be lowered. For example, while awards are pending, calculating some red flags can cause false positives. On the other hand, it's okay (and normal) for flag to be raised after an update.

  • Keep data preparation separate from indicator calculation. (see comment in #23)


  • From comments on the user research report: "It should be possible for developers to read documentation on how to implement new red flags, with as little new code as possible."

New command: prepare

Following the principle of "Keep data preparation separate from indicator calculation " #29, we can add a command to do:

and maybe these as possible (opt-in) pre-processing:

  • #32 change id references to identifier/id (perhaps if consistently available)

Pretty much all issues labeled 'robustness' could be resolved via this command.

Also:

  • Lowercase codes. For now, we'll require users to manually map such codes. If it's a common issue, we can add an option that lowercases all codelist fields used in indicators.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.