Giter VIP home page Giter VIP logo

org's Introduction

Code4rena Org

Code4rena Contest Rulebook

Code4rena is an interdependent community of people doing highly subjective work in a greenfield industry.

In order to facilitate consensus building, this repository contains the mechanism, rules, and best current practices of Code4rena contests:

  • Best Current Practices are community-driven discussions intended to provide guidance for sponsors, judges, and wardens. You can find discussions of best current practices in open rules issues.
  • Mechanism represents the fundamentals of how audit contests work and how awards are distributed. (minimal currently)
  • Rules support the mechanism with nuance applied in subjective scenarios as interpreted by judges (NOT STARTED YET)

Contributing to the rulebook

Pasted from this chat thread introducing this repo to provide some context in the absence of any documentation for this repo yet :)

+1 to it being super helpful for wardens to help identify which things from past contests seem like they may be outliers.

Can we ask everyone to file issues on this repo for each case that stands out as a deviation or a place where the rule seems inconsistently applied?

We are intending to develop as the single source of truth for which things are hard and fast rules vs more subjective and still developing.

One absolute requirement: let’s make sure issues are filed as neutrally worded inquiries rather than that turning into a public complaint field about how your own submissions were judged. (There is no code of conduct in the repo yet, but this will be there when we get a chance.)

Tip: If you feel strong emotion as you’re filing an issue, it would be wise to take a step back for a bit and then come back later and try to submit your issue as a neutral scientific examination. Creating this rulebook will be a collaborative effort between all members of the community, and eventually we will have clear processes created and documented for how changes are proposed and made.

Right now we’re bootstrapping it, so please feel free to contribute however you see best and we’ll start to work to sort the chaos out as a community. The structure is not set in stone either; do feel free to make suggestions for how it could be improved

org's People

Contributors

sockdrawermoney avatar adamavenir avatar

Stargazers

 avatar Tim Kabael avatar Tung Nguyen avatar  avatar saham avatar Lubomir Anastasov avatar Phuwanai Thummavet avatar  avatar  avatar lordofshadow avatar  avatar jadejassajay avatar  avatar  avatar Zeynek Kavak avatar  avatar Zany Bonzy avatar Trần Trung Vị avatar CJ Hare avatar  avatar Paulius avatar Kkalidfsh avatar Sissel avatar  avatar Martin Marchev avatar Emile Sean avatar juancito avatar Dahir Muhammad Dahir avatar 0xfiending avatar Kell (K42) avatar BOHDAN avatar Ihtisham Ullah avatar Ansa Zanjbeel avatar Jerome (@0xSCSamurai) avatar  avatar ABDul Rehman avatar Cathe avatar Abdullah Baharoon avatar BowTiedHeron avatar OuailT avatar Dravee avatar NISHANT AGRAWAL avatar Vacquer0 avatar G avatar Zk avatar Valentin avatar jumpdest7d avatar El patron avatar  avatar  avatar yellow avatar Bhagi.eth avatar Shun Kakinoki avatar  avatar Krum avatar  avatar Guillermo avatar Samus avatar Justin avatar Anton Cheng avatar Picodes avatar Simon Busch avatar  avatar Zach Obront avatar Bernd Artmüller avatar Ellahi avatar Kavita B avatar

Watchers

Sean Seefried avatar Ashok avatar  avatar Anurag Jain avatar  avatar Shun Kakinoki avatar  avatar  avatar  avatar CloudEllie avatar merkleplant avatar itsmetechjay avatar  avatar  avatar  avatar  avatar  avatar  avatar Emile Sean avatar

org's Issues

Severity standardization - event related impacts

It would be beneficial for all of us to standardize severities for common issues.
This discussion is regarding incorrect emission of events, for example with wrong parameters, flows where an event is emitted incorrectly or not emitted when it should be, etc.

Typically reports of that sort are graded QA. Examples of past severities:
code-423n4/2022-10-holograph-findings#270 - M
code-423n4/2022-11-redactedcartel-findings#334 - QA
code-423n4/2022-10-juicebox-findings#80 - QA

In my opinion, unless warden can show clear reasoning why this event is important for functionality of some pre-discussed off-chain setup, it should be marked as QA. For example, will not display auction in UI, etc.

Handling QA/Gas issues upgraded to Medium/High

Context: Since instating the mechanism change around QA and Gas reports, it's become apparent that there's an open question about how to handle changes in severity, when they cross over from QA/Gas to Medium/High, or vice versa.

Handling QA/Gas issues upgraded to Medium/High

These are simpler than downgraded Med/High submissions in a sense, because the awarding mechanism is straightforward. However, it raises a question for me in that maybe if a warden can't figure out why or how a vuln could become medium/high risk, they haven't earned a Medium/High payout? This one just feels nuanced, so I would love to hear from judges as well as wardens.

Relatedly: at the moment, this scenario requires significant additional effort from judges and contest admins, because we've asked them to copy/paste the relevant issue from the QA/Gas report into a new issue in Github, then add it to the findings sheet, and there's extra effort required to ensure that our tooling "catches" the new issues for award calculation (because the issues were not submitted via the submission form and attached to the warden's handle).

Obviously we'll be improving our tooling to handle these needs, but I think there's also a need to establish consensus around what's fair and appropriate.

(See also issue #9 , which is related.)

Inconsistent severity for two-step change in governance

code-423n4/2021-12-pooltogether-findings#106
The following issue was marked as high severity.
As mentioned by judge leastwood, the issue two-step change of governance will be observed as non-critical.
These issues are a missing layer of protection from human mistakes that would result in dire consequences.
there seems to be a disparity in the severity.

another similar issue that was marked high:
code-423n4/2021-12-pooltogether-findings#8

conversation in the chat channel:
danb — 02/05/2022
code-423n4/2021-12-pooltogether-findings#106
it seems very weird to me that it was marked as high risk

hickuphh3 — 02/05/2022
Could u explain why u think it's weird? There doesn't seem to be a way to recover funds if epochDuration was set to 0. Definitely high / critical impact. One can argue that the likelihood of such an error is low, thus reducing the overall severity to medium perhaps.

danb — 02/05/2022
Correct, it's just a missing sanity check
danb — 02/05/2022
in my opinion it should have the same severity as two step governance change

hickuphh3 — 02/05/2022
Ok, I see where you’re coming from. Both have similar mechanisms where an assumed incorrect input would lead to the loss of functionality / funds lockup.



What makes this case more severe is accessibility, where anyone can call the createPromotion() function as opposed to a governance setter that’s restricted.

Also, I’d argue that the loss of functionality from a wrong address set is less severe than fund lockup (although it could encompass it if governance is in control of funds) because there is the possibility of migrating to a new set of contracts, but there is no remedy for the latter.

Nevertheless, there does seem to be a disparity in severity if 2 step governance is to be considered non-critical..

Severity standardization - speculation on future code

It would be beneficial for all of us to standardize severities for common issues, as we have seen variance among judges and contests.

Speculation on future code is a reason often used (Saw Alex use it frequently) to close submissions as out of scope. I agree in many cases this is a valid reason, for example when a future upgrade could make some part of the code not work, but it's unclear if that will ever happen.

There are cases, though, where "speculation of future code" is an invalid way to close issues.

  • Contracts which are meant to be inherited from, or composed with, that have unexpected behavior (breaking invariants). Even if no example of such use is in the project codebase.
  • In centralization risk report, as a way to reject rugging vectors.

Would be happy to hear more commentary and other cases where this reason doesn't hold, and when it does.

Difference in judging

Hello after a little discussion from @Picodes he mentioned creating a thread here to get some clarity on what severity a certain type of finding should be.

There is a attack vector related to nfts that hold value beyond the nft themselves (lotto tickets, lp positions)

In wenwin issue 336 was deemed a medium which made sense to me. The idea was that a owner of a lotto ticket could sell the ticket on a secondary market and then front run the buyer and redeem the ticket. This takes the winnings of the ticket and the buyers money. Leaving the buyer with a worthless nft. The solution here was to burn the nft after redeeming the winnings.

In issue 196 of ajni I submitted a similar issue. Here an owner can sell the nft which represent an lp position on a secondary market. When the buyer buys the nft the owner can front run that transaction and redeem the nft for all of the lp position (the only option is 100% redemption). This takes all the lp value from the NFT, and the buyers money. Leaving the buyer with a worthless NFT. The solution here was to also burn the nft on redemption. Ajni already has the burn functionality i their contract they would just need to put it in the redeem function.

To me these issues are extremely similar. And it makes sense for the severity to be a medium. Without the burn functionality in redeem any user could do this and steal other users funds and eventually make the secondary markets unusable for these type of NFT's.

Any clarity on how this issue should be judged in Ajni and in contest moving forward would be greatly appreciated.

Severity standardization - Conditional on user mistake

It would be beneficial for all of us to standardize severities for common issues, as we have seen variance among judges and contests.

Example of two different severities:
Frax Liquidity contest - Using unsafe transfer to sweep mistakenly sent tokens, means unsupported tokens like USDT are stuck in contract. Judged M.
Blur contest - theoretic leak of value - if seller supplies ETH to the execute() operation, they will pay instead of buyer

We need to have a fixed policy on how requiring user mistake reduces severity level.
One simple option is - if the API is used in a way that is not sanctioned by the protocol, the MAX severity would is Low.
Another option is - if it is deemed likely by the judge that an API will be used in some particular, wrong way, and there is a lack of check for this wrong usage which may lead to user losses, it could be accepted as a Medium.

I lean towards the first option as it levels the playing field, makes it easier for judges, and cannot lead to very subjective disputes. The downside is that sponsors will get less reports for safety checks that they should add, it is unclear if these would find their way to QA reports.

Opinions are welcome.
@GalloDaSballo

Proposition for collection and processing of judge performance

C4 accumulated a healthy count of judges in its lineup. Alongside the positives that come from diversity, we are increasingly susceptible to a gap in quality between different judges. While most judges are excellent (Attempt to fully understand raised issues, open to feedback, nuanced, make a best-effort to finish judging quickly), there are some that lack in one department or another.

I believe the lack of feedback on judge performance is detrimental to the self-improvement of the judge and the organization and ultimately to warden motivation, which is a tremendous resource in the C4 force.

For these reasons, I propose a per-contest feedback collection mechanism, from both wardens and sponsors. Feedback will ideally include a score on:

  1. Judging completion time
  2. Balance and nuance of judging, ability to relate to raised issues in-depth.
  3. Neutrality of judging (have two almost identical reports been rated differently?)
  4. Qualitative analysis of post-judging QA (was judge responsive to feedback, were there too many errors in the initial judging?)

Obviously, it is not the intention that wardens "penalize" judges who did not accept their submissions, all feedback needs to be well constructed and objective in order to count.

Would love to hear your thoughts on this proposal, which I believe will help increase transparency and accountability in a crucial part of the C4 pipeline.

How to handle “post-game” judge re-assessments

Problem

For contests that were judged prior to our instatement of a 48-hour sponsor review period (during which sponsors run a QA check on the judge’s assessments), we’ve had several cases where the sponsor and judge revisit issues after awards have been calculated, announced, and added to the leaderboard, and either downgrade or upgrade severity.

While we have instituted some process changes that should reduce the frequency with which this happens, there's always some risk of "blown calls," where new context comes to light after the judging period has ended, and in the spirit of fairness, there is a desire to re-assess decisions (missed duplicates, flubbed severity ratings, etc.).

How we’ve handled it so far

In an attempt to a) accommodate sponsors & judges’ desires, and b) ensure our reports are as accurate as possible, I (Ellie) have taken the following approach to date:

  • Update findings sheet;
  • Re-run award calculations;
  • Determine which wardens were owed more than they were originally paid out;
  • Send additional payouts to those wardens;
  • Update the leaderboard with the corrected awardcalc results.

Upsides of past approach

  • Leaderboard more accurately reflects wardens’ numbers of High & Medium issues found;
  • Any wardens who were owed more money hopefully gain a sense that C4 rewards people fairly.

Downsides of past approach

This is a painstaking, laborious process and has some negative effects:

  • Costs C4 money in additional warden payouts;
  • Increases the discrepancies between actual payouts vs what’s displayed on the leaderboard;
  • Makes some wardens unhappy, because their leaderboard numbers can go down (e.g. downgraded issues, and/or the awardcalc re-run results in lower awards shown on leaderboard).

Questions

  • What might a better solution to the above-stated problem look like?
  • What should our policy be, in cases where a judge and sponsor agree that an issue ought to be re-assessed “after the fact”?

Proposal

I propose that we take the following approach (and establish a policy) for contests where this arises, going forward:

  1. Once awards have been announced, we will make no further adjustments to warden payouts or leaderboard scoring.
  2. We may adjust the risk levels of issues included in the report, but will include a note to indicate that the issue(s) in question were re-assessed by the judge and/or sponsor after the judge’s decisions were announced.
  3. In extraordinary cases, where a warden’s submission was unfairly excluded from duly-owed awards:
    1. Judges can propose a modest ARENA token grant (say, 100% of the award difference, up to a max of $5k) to make up for blown calls (measured in USD value, not ARENA count), and we batch these together periodically when we do airdrops.
    2. However, C4 will NOT update the leaderboard scoring.

Loss of Yield as High

Am flagging this Judging Decision as a way to avoid it being used again by other wardens in the future
code-423n4/2023-02-ethos-findings#482

The finding was judged as High with the following comment:

Personally view loss of yield as high severity (in line with Immunefi classification).

This is in my opinion an incorrect explanation because the contest should be judged via the rules from CodeArena, those are the rules all Wardens have opt-ed into, external rules and classification should not be a valid rationale for determining severity.

By the same logic all Consensys findings flagged as Medium would have been accepted when Medium Severities are not even exploits by their own classification

There are hundreds of reports that have classified as Med via the "leak of value" interpretation of the rules, in my opinion this should have been judged in the same way.

Interested in more Judges thoughts

@trust1995 @dmvt @0xean @0xean @0xleastwood

More detailed description of overall C4 philosophy in docs

I think I've gained a really good insight into what that is by interacting on the Discord channels, but perhaps it's time to clarify it more in the documentation. In particular, I think the recent discussion in the Discord channels around what should be awarded/not-awarded is a key part of the philosophy that could do with more clarification in the docs. I also think it would serve a dual role as "expectation management" for wardens, which would have the benefit of reducing disputes about payouts.

Severity standardization - Lack of two-step ownership transfers / lack of zero address check

It would be beneficial for all of us to standardize severities for common issues, as we have seen variance among judges and contests.

Example of two different severities:
NounsDAO - accepted as M finding
VTVL - reduced to QA

I believe the right call is QA for lack of any / both of two-step / zero address checks. They require explicit user negligence. Also, they are very easy to spot so they will definitely end up being mentioned in QA reports, we don't need them spammed in M/H discussions.

@GalloDaSballo

EDIT: The NounsDAO was a poor example, please refer to this example:
Vader contest - lack of 2 step change of DAO's address, accepted as M

Severity standardization - NFT metadata

We don't seem to have a standard severity of NFT metadata, so I created the issue to discuss.
There are two main problems:

  1. The Metadata JSON hard-coded data is bypassed and overwritten
  1. Insert script into the SVG data and run any command

The focus is on whether this issue should be filtered by the front end or the smart contract.

Welcome to discuss and supplement the case.

How are Gas/QA reports ranked?

Now that a few awards with the new methodology have been released, it would be good to hear how the judges approached the ranking of the consolidated reports. The current guidance says that they're graded on a scale but aside from that there are very few details. Here are a couple of specific questions I had:

  • Are judges ranking gas reports based on the approximate total gas saved, or on the number of classes gas findings? Same sort of question for QA reports (number of low/borderline medium issues, or total findings).
  • If a finding can fall into both reports if worded differently, is one excluded or are they counted separately?
  • Have there been any commonalities in what caused the top three reports in each category to place the way they did?
  • It seems like issues that were downgraded from medium tend to bump up the QA report's ranking - is that the case? How do you decide how much more valuable a downgraded finding is than the sum of the other findings submitted separately?

For the excluded/separately question I would think that sponsors would want to hear both reasons for making the change, so I think having the differently-worded issue in both would be best. What do others think? What has been the approach taken so far for these questions? Any other questions?

Discrepancy in Awarding Formula

C4 awards wardens by giving shares for discovered bugs. These shares give the owner a pro rata piece of the pot. Exact details can be found at here.

The awarding formula is this:

Med Risk Shares:   3 * (0.9 ^ (findingCount - 1)) / findingCount
High Risk Shares: 10 * (0.9 ^ (findingCount - 1)) / findingCount

Besides this, the best report also receives a 30% bonus as detailed here.

These two conditions can be gamed by wardens to receive more than ideally intended rewards.

Consider these scenarios in which a warden finds a valid solo high severity bug.

Scenario 1 - Ideal Case

Warden submits a single report, his high risk shares calculated according to the above formula comes out to be 10. Hnece 10 shares are allotted to the warden.

Scenario 2 - Gamed Case

Warden submits two reports of the same bug using two different C4 accounts. According to the formula each account should get 4.5 shares. But since one of those two reported issues will be considered as primary, additional 30% bonus will be added to one account, i.e. 1.35 shares.

So this will result in Account1 getting 4.5 + 1.35 = 5.85 shares and Account2 getting 4.5 shares.

Essentially the warden got 5.85 + 4.5 = 10.35 shares instead of 10 shares.

It should be noted that this gaming path for shares only exists when findingCount is equal to 2, beyond that the shares reduces significantly.

Penalty / Award Standardization - Duplicate Report PoC Thoroughness

Hello everyone, I would like to discuss today an issue w/ regard to penalty / award standardization for duplicate reports.

Description

Currently, there is no strict guideline in the C4 documentation concerning how a duplicate finding is meant to be penalized or how a finding is selected for the report. The following excerpt is the only guideline:

However, any submissions which do not identify or effectively rationalize the top identified severity case may be judged as “partial credit” and may have their shares in that finding’s pie divided by 2 or 4 at a judge’s sole discretion (e.g. 50% or 25% of the shares of a satisfactory submission in the duplicate set).

Argument

The point I would like to discuss is the necessity of a PoC (Proof-of-Concept) containing a code example. I believe that a Proof-of-Concept can be textual, not contain any exemplary code of how to exploit, and still be considered a valid finding. In the auditing industry, it is pretty standard to contain purely textual descriptions of how a vulnerability can be exploited (based on personal experience as well as publicly available reports).

A thorough textual description of how someone can exploit a vulnerability should be considered equivalent to a code-only Proof-of-Concept that demonstrates the vulnerability in practice with code. The outputs of a report are, in most cases, compiled to a browsable web-page. This is performed to contribute to the ecosystem and allow users to look up issues that have historically occurred and not repeat them.

By incentivizing code-related Proofs-of-Concept submissions rather than textual descriptions, we end up with PoCs that are specific to the project rather than specific to the vulnerability. Ensuring both types of findings are treated on equal grounds is, in my opinion, the ideal approach for the benefit of both the projects and the ecosystem.

Examples Requested

We welcome any warden and judge alike to provide us with a tangible example of a judgement discrepancy arising from PoCs.

As things stand, the most observable pattern is the selection-for-report pattern for submissions with a PoC over ones without one which does affect the topic discussed by this issue.

Parting Remarks

I would be glad to hear some judges' feedback on this and it would be good to clarify how penalization as well as selection-for-report works in the Code4rena documentation.

Severity standardization - Centralization risks

It would be beneficial for all of us to standardize severities for common issues, as we have seen variance among judges and contests.

Example of two different severities:
Frax Liquidity contest - Admin can freely set important parameters and add minters. Judged as M
Rubicon contest - admin can change important parameters - downgraded to QA.

We need to have a fixed policy on how to judge centralization risks.
One simple option is - max severity for any privileged function abuse is Low
The other extreme - lack of timelocked admin contract for any meaningful change to the protocol could be Medium risk
Obviously there are various middle of the road approached.

Personally I think that centralization risks CAN be considered M findings, provided the report shows a unique attack path which users were not told upfront about. For example in VTVL, admin can revoke user's staked tokens, after the staking period. Since this is not something users would see in advance, and it impacts their assets by right, so it does warrant a major severity level.

Opinions are welcome.
@GalloDaSballo

request: Details of judges

In the interest of transparency , publish details of judges and their expertise legitimizing their judgement for awards

Similar exploits under a single issue

There are cases where there are multiple instances of an exploit within a single contests and it's not clear whether they should be filed as separate submissions or as one submission with multiple sub-issues. Disregarding whether the findings are valid or not, these are four examples where a single category of exploit was submitted by the warden multiple times and each submission was awarded separately. Recently in the code4rena discord chat, wardens were recommended to file such instances as a single submission with multiple sub-issues, but based on the timeswap findings this was called into question. It's not clear whether doing so will be awarded the same way as if a warden files separate issues, and it's not clear how such combined submissions will be identified in the final report, given that currently each separate submission is given an issue number.

I am proposing that wardens SHOULD combine similar issues into one submission with multiple sub-issues, and that judges MUST award each instance separately, and that the final report MUST indicate separate issue numbers for each instance (e.g. [H-01,02,03]). Each separate instance MUST be clearly identified by the warden in the submission; the judge must not have to interpret the warden's intent.

Adding 75% credit as an option

Right now we (judges) can choose to give 25%, 50%, or 100% credit to an issue. I'd like to see a 75% option added. There have been multiple issues in the last few contests where 100% feels like too much, but 50% feels too harsh.

Handling High/Med risk issues downgraded to Low/Non-crit/Gas

Since instating the mechanism change around QA and Gas reports, it's become apparent that there's an open question about how to handle changes in severity, when they cross over from QA/Gas to Medium/High, or vice versa.

I'm going to file these as separate issues (starting with the one I think is trickier), so we can keep discussion focused.

High/Med risk issue downgraded to Low/Non-crit/Gas

Should we handle these as:

  1. stand-alone QA/gas reports, that need to be scored on the 0-100 scale?
  2. addenda to a warden's existing QA/gas report (if one was submitted)? (Perhaps if the warden has not submitted QA/Gas report, the judge would score it as a stand-alone, so would likely receive a low score.)
  3. invalid issues, since they were submitted at the wrong risk level? (Pretty harsh, IMO.)
  4. old-style Low/Non-crit findings? (I think this one is likely a non-starter as it would seriously muddy the waters as far as incentive structure, and make award calculation unnecessarily complex.)

Problems with Lybra Finance contest judging

Note: I would like to emphasize that this issue isn't created with the intention to blame or criticise anybody (as I highly respect every Code4Rena member and I'm sure that you are doing a great job in general), but in order to potentially make judging process better in the future.

My complaints

Recently, the report for the Lybra Finance contest has shown up and I was able to review my submissions. I have several concerns about the judging of my findings and I would like to describe them here:

  1. First of all, two of my submissions were reporting exact same errors that were accepted as Medium findings, but were rejected for some reason:
  • the issue code-423n4/2023-06-lybra-findings#44 was accepted as Medium, but my issue: code-423n4/2023-06-lybra-findings#513 was downgraded to QA for some reason. It's clear that these issues are referring to the exact same bug. Note that even my "Vulnerability details" section is the same as the "Impact section" of the selected finding. Moreover, my finding has a coded POC, which isn't present in the selected report
  • the issue code-423n4/2023-06-lybra-findings#221 was accepted as Medium and was qualified as a SOLO finding. It is about implementing slippage protection in the rigidRedemption function. My issue: code-423n4/2023-06-lybra-findings#534 was talking about exactly the same issue and you don't even have to read the description, because it's obvious from the title. However, it was classified as unsatisfactory because of "Insufficient quality" - I don't know why, since my POC is similar to the one in the accepted finding and I also wrote more in the Vulnerability Details section. Since the accepted finding was awarded a price for a SOLO finding, it hurts even more, since I lost a lot of $
  1. My second concern regards very vague judges' argumentation about why they decided to downgrade or reject issues. One of such examples can be seen above, but I will give more of them below:
  • the following finding: code-423n4/2023-06-lybra-findings#659 was closed as "Out Of Scope" without any additional explanation. However, all the contracts that are affected and referenced in the issue (StakingRewardsV2 and ProtocolRewardsPool) are in scope and this can be confirmed here: https://github.com/code-423n4/2023-06-lybra . If judges decided that this should be downgraded to QA or they explained why the finding is out of scope, I would accept it. However, no such information was given and I don't even know what I did wrong (maybe it was in the known findings - writing "it's a known finding" would be sufficient)
  • another finding: code-423n4/2023-06-lybra-findings#655 was again classified as "Insufficient Quality" without any further explanation - I don't have any problem with it being downgraded to QA, but it was entirely rejected without any reason stated. Simply saying "Insufficient Quality" gives no information as it can be already deduced that the quality is insufficient simply by the fact that the issue was rejected
  • another issue: code-423n4/2023-06-lybra-findings#530 - I have spent a lot of time to write a runnable POC (the Lybra team didn't provide us with any tests, so we had to write them from scratch) and the description and it was simply closed as "Insufficient quality". You can disagree with my arguments, you can disagree with the stated impact or issue classification, but you cannot say that this issue was low quality - low quality issues will have two sentences in the description and no runnable POC - my had both the detailed description and a runnable POC and it's clear that I spent a lot of time on that one - even if it deserved to be downgraded or rejected, it at least deserved the comment with a reason of closing it, instead of just saying "Insufficient quality"
  1. Finally, I think that not enough effort was made to understand reported issues. Here are two examples:
  • code-423n4/2023-06-lybra-findings#664 - this issue is stating that there is no input validation in several functions and that some variables can be mistakenly set to extreme values and I then show where these values are used in the code. However, it was closed as "Insufficient proof". I don't know how am I expected to prove that a simple two - lines function has no input validation - I have provided links to the affected code and shown where the variables that can be mistakenly set are used. I was claiming no input validation and I shown the code of the affected functions - I think that it is a sufficient proof. Again, if it was classified as QA, I wouldn't have any complaints, however here, I'm claiming that not enough effort was made to understand the issue I presented
  • in the following issue: code-423n4/2023-06-lybra-findings#798 I'm showing a griefing attack that may be used in order to steal gas from users: when they use PEUSD in rigidRedemption they may be forced to pay for a swap of EUSD (not PEUSD) on Curve, although they should only potentially pay for a PEUSD Curve swap. I even wrote it clearly: "Exchanging on Curve pool may be a very costly operation in terms of gas, so if it executes, user will have to pay high fee for it, although the operation has nothing to do with him.". However, the Judge's comment: "It is expected the distributeReward use gas fee" shows misunderstanding of the issue as I'm claiming that user may potentially pay for the swap which he shouldn't be involved in - it's like I was doing a swap on Uniswap and also paid for someone else's swap. I will leave the second judge's comment "Insufficient quality" without a further comment

Issues summary

I have highlighted 3 issues. Note that I don't claim that all my findings should be awarded Medium or High - I fully respect Judges decisions and I wouldn't probably reported some of these findings now, since I know that they are QA. I only claim 3 things:

  1. 2 of my findings were downgraded while identical findings that were submitted by other wardens were classified as Medium.
  2. Judges' explanations are very vague and give no feedback whatsoever for Wardens - if someone said "input validation on Code4Rena is normally classified as QA, so your finding is downgraded to QA", it would be perfectly OK, doesn't cost Judges any time, and I would learn something and would draw conclusions and wouldn't do similar mistakes. However, just saying "Insufficient quality", especially when it is clear that a Warden spent a lot of time for a submission is just offending for this Warden.
  3. No enough effort, in my opinion, was made in this contest to understand Wardens submissions. While I understand that Judges had a lot of issues to analyse, there even weren't any comments asking for additional explanation. Again, "Insufficient quality" isn't enough here.

My recommendations

  1. If you agree with me that 2 of my findings were unfairly closed / downgraded, I would like to ask you to update the report and my Medium findings count accordingly, so that they reflect that I found these vulnerabilities. I would also like to ask you to recalculate the ranking for the contest, so that these findings are reflected there. I am fully aware that I won't receive any money because of these findings, so I will be lost anyway, but at least the findings will be reflected in my Code4Rena profile.
  2. I think that it should be demanded from judges to give more precise explanations on why some issue was rejected - often one sentence will be enough as long as it's descriptive. "Insufficient quality" should only be used on reports that were clearly made without any effort. So, it will not take too much Judges' time, but it will give a feedback to Wardens, so that they know which types of findings they should avoid to publish - in the long run, it will be beneficial for both sides as lower amount of low quality reports will be published.
  3. There was already a proposal (#95) to allow non-Backstage Wardens to be able to comment on their own submissions several months ago and I think it's a good idea. Additionally, when not enough effort was made to understand the issue by Judges, Wardens will still be able to defend their findings.

Severity standardization - Protocol does not support CryptoPunks

Cryptopunks do not conform to ERC721. Since the majority of NFT-centric protocols use the ERC721 interface to interact with NFTs, they do not support CryptoPunks without some additional wrapping. It is a very common report to see in NFT related contests, and has in the past mostly been judged as QA:

Nibble - QA
Blur - QA
PartyDAO - QA
Cally - QA
Foundation - MED

Note that in the Nibble discussion, the judge which was the Foundation sponsor remarked that it was too generous to Ack as MED.

It seems like, by our severity guidelines, the issue is QA level. Punks can be wrapped externally and treated as ERC721. Also it sounds more like a feature request than something that is automatically assumed to be available.

Feel free to post feedback below.

Using contract upgradability as a severity mitigation

This contract has a vulnerability in which assets can get locked in the smart contract. This would be a Severity 3.

However, the contract is upgradeable, and the sponsor argues that if the vulnerability wouldn't have been found, they could still upgrade the contract and rescue the funds. On that basis, they suggest the Severity should only be 2.

I'm inclined to mark it as Severity 3. Even if there is a way to make users whole, it will come at a high reputational cost for the protocol, and there is the risk that the rescue doesn't work through a botched upgrade or loss of control.

Loss of Fees as Low

I'd like to flag a Judging Decision in an effort to uphold and improve C4 reports and profesionalism.
code-423n4/2023-04-caviar-findings#217

The finding was judged as Low with the following comments:

Fees will round down to zero due to integer math, I believe this is Low Severity
The fact that this can be done by the owner further reduces impact, but I believe it's still QA - Low

Thank you for the feedback @indijanc I still believe that a rounding error is Low Severity as the loss is marginal, I think quality was good and believe if you sent more this would have been awarded as QA

Also note that inputAmount is normalized in 1e18s meaning that this should not happen unless it's misconfigured

The concerning part here is that it seems the root cause of the finding was used for categorization of the issue. I believe the IT industry is pretty much aligned on how a security issue categorization or severity is determined. There's slight differences here and there but they're all based on impact and likelihood. Here's what I believe is a good reference and description for issue categorization (rating) - OWASP Risk Rating Methodology

C4 documentation has this nicely condensed in the following guidelines:

QA (Quality Assurance) Includes both Non-critical (code style, clarity, syntax, versioning, off-chain monitoring (events, etc) and Low risk (e.g. assets are not at risk: state handling, function incorrect as to spec, issues with comments). Excludes Gas optimizations, which are submitted and judged separately.
2 — Med: Assets not at direct risk, but the function of the protocol or its availability could be impacted, or leak value with a hypothetical attack path with stated assumptions, but external requirements.
3 — High: Assets can be stolen/lost/compromised directly (or indirectly if there is a valid attack path that does not have hand-wavy hypotheticals).

So based on the fact that the finding describes how a Pool owner can avoid the Protocol Fees, it falls pretty well under the "leak value" part of a Medium categorization. So when the reasoning for a Low categorization was based on the root cause (rounding on division in this case) I though to myself "Ugh, would then draining of all funds because of integer math be considered as medium? As impact is High but root cause is Low ..."

The last comment for reasoning about the issue categorization indicates that likelihood was indeed considered but again misassessed. The inputAmount (based on virtualNftReserves) is not normalized to 1e18 anywhere in the files of the audit scope, and is only recommended - sponsor acknowledged this with several comments in public chat during the contest. And discarding the likelihood with the famous words "this should not happen" is almost ironic :). Misconfigurations I would argue are one of the first and easiest attack vectors that malicious users and bad actors will abuse.

Disclaimer: I submitted the mentioned finding, so of course I am totally biased. But even putting my unbiased hat on (if that's even possible :) ) I don't think this was properly assesed and was miscategorized by looking at root cause instead of understanding the context and assessing impact and likelihood. How would other judges mark this? Do you also consider root cause when deciding on the severity of a finding? @trust1995 @dmvt @0xean @0xleastwood @horsefacts @berndartmueller @Picodes @kirk-baird @Simon-Busch @hansfriese @GalloDaSballo

Please note that I otherwise have great respect for the judge in this contest as he's an awesome contributor to the community and transparent in all his actions, but I feel there's always room for improvement. Judging these contests with huge amount of findings I know is hard work and just drains you, and you simply cannot make correct decisions every time, but I hope we're not sacrifising quality and profesionalism to gain speed.

There are little incentives to produce quality descriptions

Issue

Most elaborated descriptions are made front line and getting posted to the reports, but authors aren't currently compensated anyhow as duplicates have fixed equal weights. At the first glance why should they? It looks like the process is random, best description is getting publicity, while others aren't, the core process is finding the vulnerabilities, not reporting them.

But being repeated over this process skews the motivation of the wardens. Producing quality reports takes time and effort. Writing a good POC involve reasonable amount of coding and can take a whole day. During that time another issue can be found. Wardens who do not produce detailed descriptions win over time by adding extra issues over ones who did invest time to the writings. These efforts are not being compensated and become public good, adding value to the whole project by raising the quality of reports.

Imagine that all the wardens found out that 'spam hints' is an optimum behavior and adhere to it, i.e. there be no one for a free ride. The quality of reports will be degraded, they become unreadable for anyone who didn't investigate the code for several hours. Some valid issues will be not understood by the projects and declined as it's not always clear what is going on without proper details even for authors.

Proposal

What can be done? Currently front line issues are chosen by judges and this works well, the issue is that weights are equal for all the duplicates.

The ideal situation can be the grading along the curve within each issue as it's currently done for QA reports. I.e. front line issue automatically receives the 100 rating, while others are given some 0-100 score by the judge.

Current curve will work fine here as it's quite rare for an issue to be found by more than 20 wardens. This will handle well the situations of several quality descriptions for the same issue as, for example, 90 and 100 rated descriptions will be compensated nearly the same. This will also handle the core issue of having one good description and a number of short or inarticulate ones that still catch the main point. Ratings 10-20 and 100 will be paid quite differently. In the same time when there is no stark contrast between the findings it will not take much time from the judge to put say 80 or 90 ratings across the group, nearly replicating the current situation.

What do you think?

Marking contract functions as payable gas optimization

I would like to start a conversation regarding some of these gas optimizations that I see in some of the previous reports, mostly because I am interested in learning more about this space.

The first one that I'm interested in is when we mark contract functions as payable, it appears to save some gas.

I created a repo to demonstrate that marking a contract as payable actually does save some gas:
https://github.com/xliumin/payable-gas-optimization/tree/main/src
(run forge test -vvv)

[FAIL. Reason: Revert(Reverted)] testExample() (gas: 57835)
[FAIL. Reason: Revert(Reverted)] testExamplePayable() (gas: 57811)

The tradeoff of security (potentially locking user funds) makes me believe that we should not accept this as a gas optimization.

Also I found one gas optimization finding that was in the opposite direction (removing payable from contracts that don't use msg.value):
code-423n4/2021-12-yetifinance-findings#15

  1. Are there any situations in which marking a function as payable would not save gas?
  2. Should judges accept these as gas optimizations? (I would say no because of the security tradeoff)
  3. Should it always be considered a low severity finding to remove payable markers from contract functions that don't use them?
  4. Is this the right forum to open this kind of discussion? What is the best way to incorporate these in the rulebook?

Contest Retrospectives

C4 operates competitively, but it's members are also operating as a team whose goal is to make a sponsor's project more secure. Wardens, Judges, and CAs all perform roles throughout the contest with these goals in mind. Often there are distinctions gained or suggestions about how to make the process more effective which are surfaced during a contest. In agile software development we have a process, called a retrospective meeting, designed to surface the successes and failures of the team over a certain period of time. I propose we add this concept as the final stage of C4 contests.

How would this work?

Retrospectives are designed to be a blameless self evaluation where self is the team as a whole. Atlassian has designed a playbook for these meetings which is adaptable to the process we have at C4.

The stages of a retrospective meeting are actually quite simple. Team members (judges, wardens, and the contest's CA or CAs) start by writing down individual notes that fall into the category of "what we did well" or "what we can do better", with team members provided 15 minutes for surfacing notes in each category. "We" in this context is the entire team, not individual members. It's generally advisable that each participant come up with at least one idea in each category.

Once these notes have been gathered the team works to group them into themes, similar to our deduping process. These themes are then discussed by the team for 10 minutes. A further 10 minutes is taken to discuss specific actions which can be taken to improve future contests. Traditionally individual team members are assigned to actions, but in our context these will probably inform process changes that are adopted in future contests.

Frequency

I suggest that we start with one or two contests to see how the process translates and am volunteering to do this for contests which I judge. Once we as an organization have had a chance to tweak the process in ways that make sense for C4, I'd recommend that we do this for at least one contest per month. The more controversial the contest, the more effective the process is likely to be.

Participation

I envision participation being open to all judges, all C4 staff, and any wardens who have participated in the contest for which the retrospactive is held. Participation in the contest for wardens would be defined as having submitted at least one issue or having commented during post-judging qa. Optionally, we could invite sponsors to join this effort if they want to.

Scope

The scope of each retrospective would be the contest in question and nothing else. Past performance of anyone involved should not be a subject of the retrospective, mostly because any past performance issues likely involve a different team.

Potential issues with the process

Agile development teams typically consist of no more than 8 members (after including the project manager and product owner). C4 contests sometimes involve more than 100 people. For this reason it may prove necessary to limit the participation or run multiple retrospectives for each contest.

In software it is usually fairly easy to track action items. In C4 doing so may be harder given that not everyone involved in the retrospective will participate in every future contest. It is also likely that many of the actions will fall on C4 staff or people who were not part of the specific retrospective (for instance, all wardens or all judges). This will likely require some fine tuning specific to the C4 structure.

Many times a neutral third party is helpful in retrospective meetings to help keep members on track and suggestions blame free. The nature of C4 may make it hard to find someone who is truely neutral in this way. It's possible that a judge or C4 staff member may be able to take this role, but it may be that we need to add a category of participant to the process who acts solely to run these retrospectives.

Thoughts?

Approve Race Condtion - Suggested NC or Invalid

See my older thoughts:
code-423n4/2022-01-yield-findings#6
code-423n4/2022-05-backd-findings#117

The approval race is technically true, and could be considered a threat based on your opinion / bias

If you believe that tokens you've approved belong to you, then you may be inclined to say that approve is vulnerable to front-run.

However we can counter that by saying that cancelling the allowance, either via approve(X, 0) or via decreaseAllowance(X, 123), the allowance could still be front-run, meaning there is always the possibility that an allowance can be used before it's revoked

The alternative perspective (the one I have), is that giving your allowance is equivalent to giving your coins away, because the race condition above means in clawing them back you could still lose them.

The reason why some people argue this to be valid is that in the case of increasing an allowance, setting:
approve(X, 1)
Then calling
approve(X, n)

Will put the first allowed tokens at risk as they could be "doubly spent".

My conclusion from the entire takeaway is to treat any allowance as if the tokens are spent / lost.

And to set back to approve 0.

Given the above I would recommend a Non-Critical Severity or outright making it invalid as race-conditions are a property of public mempool chains

Severity standardization - Contract unupgradeability

It would be beneficial for all of us to standardize severities for common issues, as we have seen variance among judges and contests.

Example of two different severities, for the same issue of lack of __gap[50] array for upgradeable contracts:
Rubicon contest - awarded M
Nouns Builder contest - downgraded to L.

Most of the time I've seen this issue judged as L/QA. IMO since it is an easy find and will find its way to the QA reports, we can safely dismiss it as a L as no funds are at risk and there is an option to reinitialize the contract's existing variables, so current state is not at risk of being unupgradeable.

Opinions are welcome.
@GalloDaSballo

Judging of valid submissions without POC

Valid submissions without POC should not be nullified.
Considering the short duration of the C4 contest, warden may prefer to spend time on new findings rather than writing POC.
Currently C4 already has a 30% bonus for high quality submissions, which encourages warden to write POC.
But if we further slash valid submissions without POC, then warden will have to spend time "optimizing" their submissions instead of making more findings.

Standarization of a Backstage role/ viewing findings after contest

Proposal Title

Proposal to allow all wardens to be somehow able to fight for their issue after the contest.

Description

It is not a secret that having the backstage role is a blessing for anyone having it, but that does leave in complete disadvantage most of the wardens that do not have it. Talking to several top wardens from c4, all of them told me that they massively benefited from being able to discuss their issues after the contest.

Several examples are:

The issue was not 100% clear and you can explain it better.
The judge does not understand the issue to its fullest.
The sponsor simply does not believe that the issue is valid and/or wants to downgrade or invalidate it.
Not noticing that it is a duplicate of x.

Talking to backstage wardens, the number of issues that got accepted and rewarded increased after they got their backstage roles.

Today, it is even worse than before because you can't get the backstage role anymore because it has been paused for a while now, so even skilled auditors that recently started, can't get there.

This is a net negative for c4 because they feel like they get less than they deserve because they can't dispute their issue the way other top wardens can.

Proposal

There has to be a way where all the wardens, or at least wardens that scored higher than x, but preferably all, are able to see their issue after the contest.

I do believe that the best model to the moment is to adopt the Sherlock model, where an escalation period to submit your response and an escalation proposal is given.
I think, the only inconvenience to these, is time because c4 takes more time due to having more users and taking QA and gas reports.

Therefore, I propose a model where escalations are at least somewhere around 30/50 $, therefore, the volume of escalations should not differ much from the current one because only wardens that are completely sure about their issue being valid, would escalate it.

I think that it is clear that something regarding seeing the issue after the contest has to change because there is a massive disadvantage between wardens, but we have to find a way to achieve it, so even if my current proposal is not accepted by most of the wardens/ c4 team, I think it would be the best to find other alternatives to solve the issue.

Also, if I am not wrong the c4 team needs an NDA from wardens to access backstage, if that is required, I would perfectly continue that model together with the proposal. So, wardens who signed the NDA can access backstage.

Bot race submission automation

The initial plan for bots was to make submissions fully automated, and I personally believe this should still be the goal.

Below are the main advantages and possible issues of implementing fully automated bot submissions:

Advantages

  1. Instant Access to Reports: Automating the process would allow judges to access the reports almost instantly, reducing the wait time compared to the current approach, which could take 1-2 hours depending on how many concurrent races we have.

  2. No Pressure for Bot Racers: With automatic reports, bot racers would no longer need to plan for the race(s). Standardizing the participation process would be beneficial, especially for participants in different time zones.

  3. Simplified Judging: Automating the report generation would make judging easier. Currently, a significant amount of time during the race is spent manually removing false positives from the report. By eliminating this step, judges can focus on evaluating the best reports, which are the ones that are cleaned automatically. This approach would help judges identify and penalize bots with a high number of false positives more efficiently.

  4. Higher Scalability: By removing human intervention, top bots can participate in all races without any effort, ensuring a higher level of scalability and quality.

Issues

  1. Risk of False Positives: Without manual filtering, there might be an increase in false positives, which could initially slow down the judging process. One solution to this issue is to implement a heavy slashing of false positives, allowing higher quality reports to be prioritized for review, while reports containing a lot of spam can be skipped.

  2. Potential Technical Issues: If something breaks, bots might not be able to participate in a race. To address this concern, we can conduct a few test runs concurrently with the current approach. If no participants encounter problems, we can proceed with full automation in the future.

  3. Delivery of the Report: Bots are now highly complex and valuable, making bot crews hesitant to share their source code. A possible solution is to put the bot behind a server controlled by the crew, and provide access to C4 through an API.

  4. Scope is not standard: Currently, all contests have different formats in the README, making it difficult to parse the correct scope. As of now, this is a manual operation performed by the bot crews. We need to establish a standardized scope written by the scout, which should remain consistent for every race.

Proposal to disallow overly confidential contests

The recent Chainlink CCIP contest has introduced a completely new level of contest confidentiality. The only parties who will ever have access to the findings repo are: Specific contest's Lookout + Judge, and Sponsor. I believe this format to be a disastrous precedent, which not only goes against the otherwise open nature of C4, but is also functionally insufficient.

I have not yet judged a contest without several errors pointed out during the QA period, and the same can probably be said by all judges. When there are 700-1000 findings per contest, no matter how great your intentions and efforts are, there will be mistakes - wrong dups, actual invalids and so on.

The QA period is a mandatory layer of defense! (Backstage) Wardens should have the right to defend themselves if interpreted wrongly by the sponsor or judge. They also have the right to scrutinize any issues that on the surface seem legit, but concrete evidence shows they aren't.

As an audit firm manager I wholly understand the business needs of catering to clients' needs and customizing specific packages. But there are some lines, that when you cross, you lose too much of what made you great to begin with. It is at this moment that an organization's courage is tested.

Say no, stay open-source. We're playing the long game. The rest is noise.

Fault in out-of-scope library, impact in in-scope contract, to reward or not to reward

Hi,

Contract A is in scope and uses library B, which is not in scope.
Library B has a bug, and contract A can be proved to misbehave (i.e. provable HM-compatible impact) because of this bug.

Are such issues considered for rewarding?

I did not find any source of truth with regard to this situation, only comments here and there on submissions, and I just realized my assumptions - that come from Immunefi, where such issues are rewarded - on the subject may just be incorrect here.

TY!

Proposal - Appeal committee

I believe that in order to further democratize the C4 experience it is necessary to add an additional stage to each contest - an appeal committee. Wardens that are convinced their submission has been incorrectly judged, after Judging QA, should be able to file an appeal.

An appeal committee will be comprised of 3 judges, who did not judge the contest in question. By majority vote, they will come to a decision regarding a change in severity or re-opening of closed submission.

Appeals cannot be filed recklessly. They will require a deposit of $200-400, which will only be returned to the warden if the appeal is accepted. In this format, I don't expect appeals to be sent except in situations where they are justified, respecting the time value of judges.

Appeals will give further confidence and motivation to wardens, as they are guaranteed that their submissions would be judged correctly and in a well-balanced, nuanced tone. It is an inevitable step in the standardization trend currently in motion.

Clearly, the appeal stage will delay the completion of contests by several days. Contest pipelines are currently ~2 months long. I think it is a reasonable price to pay for the preservation of democracy in C4.

A case for a mandatory medium and high severity security markup language

Justification

Currently reports on vulnerabilities either contain 1 of 3 types of POCs:

  1. Full scale unit tests that run and pass (or fail to demonstrate the issue)
  2. A list of instructions such as Alice sends 2 Eth to Contract. Bob then calls withdrawal
  3. Verbal handwaving

While 1 is ideal, requiring wardens to write up unit tests can place quite a large burden on them. On the other end of the spectrum, low quality verbal handwaving burdens sponsors and judges with having to validate the implied logic. Quite often the verbal handwaving conceals logical errors since there is no "static typing". It also penalizes non-first language English speaker who could make grammatical errors or make strange use of idiomatic word choice not usually found in English.

Pseudo code isn't helpful because vulnerabilities are usually reliant on blockchain specific nuances such as whether the attack is a flash attack (same block) or requires great periods of time to pass (timestamp overflow). What would be helpful is a simple markup that is pertinent to security in the context of an EVM type blockchain and that is flexible enough to allow for custom solidity (or vyper) when the need arises.

Introducing the Blockchain Audit Markup Language (BAML)

An attack vector consists of the following components:

  1. The initial state of the contracts. This is the deployment/initialisation/prior usage scenario that sets the scene
  2. The span of the attack. One block? Many blocks? Time period?
  3. The POC
  4. Caller context. That is, who is calling what.

Every attack vector can be specified using the above Four points. In addition, some contract calls can get quite complex because of byte packing and hashing. So there should be an allowance for custom code when necessary to avoid requiring the warden become a BAML master simply to submit an issue. The following example should serve as a universal template. Comments are double forward slash as usual. We're auditing the contract called Dog. We're showing that a flash loan on UniV3 can break subsequent calls to Dog.swap():

Assumptions:
Let Uni be UniswapV3 factory
Let Pair = Uni.getPair(ETH,USDC)

INIT:
//...deployment calls
deploy Dog with Owner = X
Dog.init(30e18,20e18,20e18) with caller = X

POC:
Block begin, num =10
DOG.swap(10e10,6e15) with any caller
Let complexParam =

//some complex solidity

Pair.flash( X,1000e18,0,complexParam)
Block end

Block begin, num = 11
expect DOG.swap(10e10,6e12) to revert with
Block end

Epilogue

Improvements to Markup language welcomed. This should evolve over time as C4 users discover better ways of doing this.

A special BAML review user could act as a human compiler to narrow down the list of valid issues the sponsor and judge have to review

RFC - A case for or against Admin Privilege Reports

This discussion is aimed at exploring which Admin Privilege findings we should continue to mark as valid, vs consider "out of scope", "known risks", etc...

The discussion requires taking both sides as there seems to be no ideal, simple answer, Malicious Sponsor must be called out for their malpractice, however, Honest Sponsor don't like being told what they already know.

Context:

With end users here I mean "retail", "non technical people", people that rely on an expert for their security.

While I would recommend any person even remotely involved with Ethereum's tech to learn the basics, we cannot assume that and we know that the majority of users are "blind signers"

That said, we have historical track-record of our Reports being used to "prove" the safety of a Sponsor system.

Ultimately this is natural as C4 being 50% Crowdsourced Security and 50% being proper Formal Audit (thanks to some legends in the community).

This means that our Reports are used as marketing tool, in the right hands they are just proof of thoroughness (Project X did Audit1, Audit2 and even did C4 to iron itself out)

In the wrong hands, they are a tool to get a quick list of issues, dispute them all and put the logo of C4 on their website.

Historical Precedents

Since day 1, we've had some findings over time we ended up calling "Admin Privilege".

Admin Privilege findings are specific attacks that a Privileged Role can apply to the system in scope.

Because of their nature, they are controversial (if not outright insulting to the Sponsor), however most sponsors (including myself) recognize them as valid as in they are technically risks that the end user has to accept if using the system.

Also, while extremely easy to downplay, our space has seen these "basic mistakes" happen countless times:

  • Project leaks their PK
  • Project get's rugged
  • Project is hacked

And ultimately because of the anon-friendly nature of the space, it's impossible to discern between an Incompetent Project (Leaks the key because of bad Opsec), or a Malicious Project (steals the funds and claims the keys were compromised).

There is ultimately no way of knowing.

The solution to the above is not a "more trust", or "more due diligence" type deal, as again we are not in a position to know if the Admin did it on purpose.

The only solution to the above is to always flag those reports as valid risks that end-users face if interacting with the contract.

This has been consistently practiced for a long time at CodeArena, and perhaps due to market demands is not being put to question.

The discussions goal is to determine whether it's correct to flag Admin Privilege type findings, and what is the limit of absurdity as well as how these types of reports can be gamed.

While there have been oscillation in judging, I think a very strong argument can be made for Admin Privilege type findings to always fall in a Medium Severity.

See gist for 2021:
https://gist.github.com/GalloDaSballo/881e7a45ac14481519fb88f34fdb8837

First Principles Questions, First Principle Conclusions

If we agree on the above, then the next logical question is what C4 is, or what we want it to be.

Do we want C4 to be a technical only type firm? Or do we want to keep the user facing nature of the Project?

If we keep it public, knowing the above, we have to agree that Admin Privilege has to be flagged.

Not flagging Admin Privilege allows malicious projects to create fake projects and just rug via Admin Privilege, while using C4 as a quick "We were audited bro" scapegoat tool.

Risk of Abuse, where do we draw the line?

A few specific instances are linked below, with the goal of sparking a discussion that can help us figure out how to best deal with these type of reports in the future:

[REDACTED CURRENT CONTEST FINDING]

Zora Nouns Builder, Vetoer, lack of Vetoer and Bribing

These three reports show the circular logic that we may have whenever we deal with Governance System as well as Admin Privilege.

I have judged these, and you should read the comments in depth to understand my point of view.

Anyhow, I don't mind disputing these as lower severity (e.g. Low), which I think can be argued for or against.

What I'd like to show is that if you pick any of the three as Med or Low, then you are logically bound to rate the others with the same severity.

That's because these three findings (to me), are different perspectives on the same issue:

Do we need to call this another trilemma? I think we can do without.

But I think you'll be hard pressed to argue that a Vetoer is a totally good thing, it's not, it's a compromise, same for not having a vetoer as well as any other governance attack which has been exploited in the past

Request for Feedback

Do you agree with me that those 3 findings are logically equivalent?
(Vetoer is trusted party, opens up risk, lack of vetoer is risk, bribing is a specific attack when lacking a vetoer)

Do you agree with me that because of the historical context, and the rules we have, given the logical equivalency, those 3 findings are all Med Severity?

Given the two above, do you agree with me that this situation shows "a glitch in the matrix" as due to logical equivalence, circular logic is made valid as it would otherwise create favourites?

What do you think we should do?

  1. Acknowledge and continue consistently with the convention

I'm thinking that explicitly acknowledging the contraddiction, out of an interest to transparency to end-users is probably the best first step, meaning we will keep flagging these reports, fully knowing that sponsor will disagree in the majority of the cases.

  1. Acknowledge and create a new category, Admin Privilege

This is equivalent to the above and creates more work for the Judges, but should reduce attrition with the sponsor, Sponsor feedback should be sought to better understand this option

  1. Find the line of absurdity

Let's further discuss about the line of Admin Privilege vs logical absurdity through nuanced, evidence based discussion.

Severity standarization - Whitepaper / Specification Bugs

Description

There after often bugs in a whitepaper / specification documentation which are not implemented in the code.

It is considered a valid issue in c4 if there are issues in the whitepaper. However, I don't not believe we have a standard for grading this issues. I would like to propose a system for translating bugs in a whitepaper to c4 severity.

New standard

  • Whitepaper: high to critical -> Medium severity
  • Whitepaper: QA to medium -> QA severity

Motivation

The motivation behind this reasoning is that whitepaper issues do not allow direct loss of user funds if these issues are not present in the code. However, they can seriously impact third parties who are relying on specific logic.

Furthermore, newly hired developers will read the whitepaper to get up to speed and maybe introduce bugs in future development, thus this raises maintainability concerns for the project.

Proposal - Guidelines for demoting judges

We have clear guidelines on how wardens can become judges: https://docs.code4rena.com/roles/judges
However, there is no such guideline for the reverse process.

There might be many reasons we might want to demote a judge, so it is important to have clear guidelines on this.

  1. A passionate judge might become apathetic in time,
  2. A judge could have been misjudged to have sufficient knowledge and skills when they were hired/promoted,
  3. There might be wardens better suited to be judges but all the seats are taken by old & rusty judges,
  4. A judge might be/become unfair/malicous,
  5. A judge might be/become lazy,
  6. A judge might no longer be active in C4.

So what can be some clear reasons for a demotion? Below are just my suggestions.

  1. Judge is generally too slow,
  2. Judge is generally not responsive in post-contest triage and post-judging QA,
  3. Judge makes too many mistakes (e.g. marking valid as invalid and vice versa),
  4. Judge is found to favour specific wardens (or found to be unfair in other ways),
  5. Judge "willingly makes a mistake" when the mistake was clearly demonstrated in post-contest triage or post-judging QA.

For the 5th one, we must quickly make the "Appeal committee" a reality. The appeal committee would also decide if judge was unreasonable in their decision. This way we can record repeat offenses and have a quantitative guideline on the demotion process.

Also related: #71

Severity for protocol does not work as intended

I’ve seen few vulnerabilities related to protocol not working as intended to be marked as medium vulnerabilities. I would like to raise the debate about how these kind of vulnerabilities should be handled.

Example:
Token that is expected to be a limited supply token where minting cannot exceed specified value. There is a vulnerability that allows unlimited minting by the admin.

Personally I think these kind of vulnerabilities should be treated as high severity. The protocol is not working as intended and it causes serious harm to users that make financial decisions based on false information. I saw the argument that to exploit this vulnerability it would require malicious admin but this is incorrect. Simple exploitation of this vulnerability would be just an announcement that minting is not limited and that would cause sell pressure and loss for users.

Proposal - Additional grade for bot contests

Currently, the three grades are:
A - High-value report
B - Satisfactory report (eligible for rewards)
C - Unsatisfactory report (not eligible for rewards, warning/disqualification)

Two Cs disqualify bot crews for future races. I believe the grading system lacks a middle tier between B and C, and propose the following structure:
A - High-value report
B - Satisfactory report (eligible for rewards)
C - Unsatisfactory report (not eligible for rewards, no negative impact)
D - Particularly low quality report (not eligible for rewards, warning/disqualification)

The proposed tier system is less punishing for newer bots who are being improved but don't have too much to show as of yet. It will also make judging more fine-grained. Judges will not feel the need to upgrade a strong C to a B to not disqualify a crew, which has the side effect of diluting the pot.

Happy to hear thoughts.

Bot race judging standardization proposal

TL;DR Update - 30/Oct/2023

Given the length of this thread, and how the process evolved in the meantime since the original thread was started, these are the current expectations from the judges:

Final Rank

  • >= 80% Rank A
  • >= 60% Rank B
  • < 60% Rank C

Score

For each valid instance:

  • H +20
  • M +10
  • L +5
  • NC +2
  • GAS-L +5
  • GAS +1

Penalties

Subjective issues that are considered invalid should not be penalized, but they can be ignored. If not sure, an issue should be ignored, and not penalized.

For each invalid instance:

  • H -20
  • M -10
  • L -5
  • NC -2
  • GAS-L -5
  • GAS -1

Disputed Issues

This section should not be awarded or penalized, but it can be used while judging. It is recommended to double-check a disputed finding before penalizing the referred one, as the former may be invalid.

Post-Judging

It's expected that the judge will share a judging sheet with detailed information about the winner's score (e.g. this report)
The judge should also send each bot crew their respective detailed judging sheet when asked on DMs.


Original Thread

It is really important to have a coherent standard between bot races, and this proposal is an attempt to define some guidelines for the judges.

Reports are starting to be very long, and so it is starting to be very hard to judge if the main repository contains lots of lines: sometimes it's simply not possible to review every issue for every bot, as it would result in millions (!) of lines to be read by the judge in less than 24h.

I propose the following phases that should be handled by the judge in every bot race.

Phases:

  1. Risk alignment (first pass)
  2. Issue sampling (second pass)
  3. Report structure analysis
  4. Scoring

1. Risk alignment

The main goal of this phase is grouping/clustering the risk of each issue by looking at its title: this is to ensure that the next phase is fair.

  • If a bot crew submits an issue with a wrong risk rating, it should be upgraded/downgraded accordingly. If this happens multiple times on the same issue, the judge MAY issue a warning. If this happens again after the warning, the issue is treated as invalid.

  • If two bot crews submit the same issue with a different risk rating, the judge should apply the same risk rating to every duplicated issue.

GAS issues should be marked as [GAS-L, GAS-R, GAS-NC] by the bot crews with the following criteria, based on the optimization:

  • GAS-L: storage or >1k gas
  • GAS-R: calldata/memory or 100-1k gas
  • GAS-NC: everything else

2. Issue sampling

We should decide on a specific % of issues that should be sampled in each category, with a minimum of m and a maximum of M issues to be reviewed.

These numbers (m and M) should be constant for every race, and they should be chosen so that it is possible to judge any contest in less than 24h.

The judge will review a max of M * categories and a minimum of min(largest_report_issues, m * categories) issues, where categories = [H, M, L, R, NC] = 5 and largest_report_issues = total number of issues of the largest report, per report.

For each sampled issue, the judge should leave a comment choosing from:

  • A (valid)
  • B (valid, but with false positives with range (1 - 20)%)
  • C (invalid, plain wrong, or false positives > 20%)

The judge should pay attention to the number of instances if this issue has duplicates. If this number is less than 50% of the number of the best duplicate, it should be penalized by reducing the rank, once (e.g. A -> B, or B -> C).

This is to avoid bots that submit just a single instance of an issue to avoid false positive penalties.

3. Report structure analysis

This step ensures that the report is well-written and readable. A report with bad formatting is not very usable, so reports should also focus on structure quality.

This is a subjective judgment, but it is split into categories to ensure fairness.

The judge should review the following, giving a score between A and C:

Structure

Good examples: follows the table of content best practices, the report flow is easy to follow

Bad examples: issues are put in random risk order, lots of links/references, no formatting

Content Quality

Good examples: Good grammar, good descriptions, conciseness

Bad examples: Wall of text, fails to correctly explain the issue's impact, zero references for non-obvious issues

Specialization

Good examples: able to understand the repository context, able to describe an impact for the specific protocol

Bad examples: generic descriptions of issues, generic issues that are technically valid but are impossible to occur due to how the project work

4. Scoring

The main goal for the bot race is to generate a report for out-scoping common issues: quantity and quality should be the main criteria, while risk rating should be secondary.

Main reason is to avoid having a report winner that finds multiple high issues but almost zero low/NC.

For these reasons, risk rating should be flat instead of percentage based.

Main scoring formula:

	score = 0
	for each risk_category:
		score += issue_rating_avg * risk_category_qty * risk_category_multiplier
	score += report_structure_score

issue_rating_avg formula:

	score = 0
	for each sampled_issue:
		if judging A:
			local_score = 1
		if judging B:
			local_score = 0.5
		if judging C:
			local_score = 0
		score += local_score
	score /= sampled_issue_qty

risk_category_multiplier formula:

	if risk H:
		score = 9
	if risk M:
		score = 7
	if risk L:
		score = 5
	if risk R:
		score = 3
	if risk NC:
		score = 1

report_structure_score formula:

	score = 0
	for each report_criteria:
		if judging A:
			local_score = 7
		if judging B:
			local_score = 4
		if judging C:
			local_score = 0
		score += local_score

The bot score is then applied to a curve, to give the final score that will be used to calculate the rewards.

Winner: highest score
A: at least 80% of the winner's score
B: at least 60% of the winner's score
C: at least 40% of the winner's score
D: less than 40% of the winner's score

Rate C and D get zero rewards.

A bot is disqualified (can't compete in races until it is qualified again) if any of these conditions occur:

  • Single D grade
  • Does not participate in 2 races in a row

Standardize acceptance of reports based on automated findings

It is (still) unclear to most wardens and some judges which automatic findings can be reported to contests.
Quoting this discussion:

Wardens may choose to use c4udit and other automated tools as a first pass, and are welcome to build on these findings by identifying high and medium severity issues. However, submissions based on these will have a higher burden of proof for demonstrating to sponsors a relevant hm exploit path in order to be considered satisfactory.

My understanding is that you cannot report these automatic findings without reasoning about how they are (under your opinion) a MED or a HIGH, and proving it. It's not the same reporting "There is no check that the treasury address is not zero" than reasoning about how this could lead to some stuck NFTs in the contract, for instance.

I got recently marked as unsatisfactory this report and I think that if this one is unsatisfactory, all findings built on automated findings should be rejected, because ultimately they are mitigated the same way (i.e.: requiring an address to be different than zero, using safeTransfer, etc.) regarding how elaborated the attack vector is.

But the ultimate goal here is to provide value to the sponsor. Automated findings, if they are all rejected, could lead the sponsor to not mitigate them if they are not aware of the impact they could have, and this is bad for both the sponsor and code4rena.

I'd like to see this point standardize, so it is clear to all parties (wardens, judges and sponsors) when a report based on an automatic finding is valid and when it is unsatisfactory.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.