mwhicks1 / analyze-conversion Goto Github PK
View Code? Open in Web Editor NEWExtract information about conversions to and from checkedc
Extract information about conversions to and from checkedc
These programs were ported 4 years ago, when Checked C was new, and weaker than it is now. Some artifacts in these files are not needed now, and make the diffs worse.
In addition, we might want to make changes to both manual
and orig
so that formatting diffs due to 3c
are minimized. For example, when 3c
rewrites a multi-var declaration, it moves each variable to its own line. To avoid it seem like this is something useful in the 3c diffs, we probably want to refactor manual
and orig
to use multi-line declarations, to start with. There are probably other cases with line breaks, etc.
The discussion of changes to make, to both orig
and manual
, are in this document.
Ptrdist
Olden
The benchmarks herein use directory orig
as the original C code, and manual
as the Checked C-ported versions.
The ported versions have artifacts in them just to make C3 happy. For example, line 486 of manual/anagram.c is
printf(StringFormat, apwCand[u]->pchWord, (u % 4 == 3) ? '\n' : /**/ ' ');
The /**/
was added just to avoid C3
messing up the unconversion process. These same artifacts should get added to the orig
versions, so that when diffing the two, we don't count that change as a legitimate difference.
Probably the best way to do this is to script the change, so that the edits are made to both orig
and manual
by the script, and thus future improvements to C3
that make the artifacts unnecessary just require changing this script, and not both originals. Ideally this script uses sed
, so that other changes to the file don't throw it off.
In sum:
sed
script to make fixups to orig
and manual
, for each programmanual
changes so that the script reproduces themMacro expansion conversion failures:
Ptrdist (manual and reverted):
https://github.com/correctcomputation/actions/runs/2500631483?check_suite_focus=true
https://github.com/correctcomputation/actions/runs/2500630770?check_suite_focus=true
Olden:
(Original): https://github.com/correctcomputation/actions/runs/2500628967?check_suite_focus=true
(Reverted): https://github.com/correctcomputation/actions/runs/2500629721?check_suite_focus=true
(Manually): https://github.com/correctcomputation/actions/runs/2500629978?check_suite_focus=true
They fail during convert_project.py
during macro expansion.
Rather than just doing diff -w
when computing diffs between 3c-orig
and manual
and 3c-revert
and manual
, we should do so only after filtering out occurrences of:
_Unchecked
keyword#pragma CHECKED_SCOPE ...
(to the end of the line)Make a script in the analyse-conv/src
directory to do these things, and call from create-stats.sh
when making the diffs.
We want to tabulate the changes / work at various stages. I propose we compute the following:
orig
) --> Manual port, reverted (revert
): Count the lines changed (added/removed)revert
--> manual port (manual
): Checked pointer statistics & lines changedrevert
: Checked pointer statistics & lines changedorig
: Checked pointer statistics & lines changedFor "Checked pointer statistics" I mean the result of 3c -dump-stats
, but enhanced to include casts. Here's the table we can generate, filling out the entry for yacr2
.
Program | Ver | Lines refc. | Lines annot. | # ptr | #ntarr | #arr | #wild | #bounds | #casts
--------------------------------------------------------------------------------------------------
yacr2 | manual | 168 (6.5%) | 290 (11%) | 15 | 5 | 135 | 1 | ? | ?
| revert | n/a | 253 (9.6%) | 54 | 4 | 93 | 5 | ? | ?
| orig | n/a | 245 (9.4%) | 53 | 0 | 93 | 10 | ? | ?
anagram | manual | ...
The first row comes from bullets 1 and 2; the second and third come from bullets 3 and 4, respectively.
To-do list
src/create-stats.sh
to compute the refactoring statistics, per programsrc/combine-stats.sh
to combine statistics within Olden, Ptrdist3c -dump-stats
to properly compute bounds information3c -dump-stats
to count added casts3c
to compute pointer count information. Probably want a new script in src
to do this; the aggregate JSON results are saved (other files generated are discarded).3c -dump-stats
to compute missing information (this issue).tex
, from the gathered data filesDetails on these tasks below.
For the lines-refactored count, we can do this (from the analyse-conv/src/Ptrdist/yacr2
directory):
> diff -w orig revert | diffstat -s
12 files changed, 168 insertions(+), 139 deletions(-)
where the total lines are
> sloccount orig
... (lists the 14 files) ...
Totals grouped by language (dominant language first):
ansic: 2597 (100.00%)
Assuming every inserted line matches a deleted one, we determined MAX(#insert,#delete)
to be "lines changed," giving us 168, or 6.5% of the total lines (168/2597 * 100).
For the lines-annotated counts, we do
diff -w revert manual | diffstat -s
relative to sloccount revert
diff -w revert 3c-revert | diffstat -s
relative to sloccount revert
diff -w orig 3c-revert | diffstat -s
relative to sloccount orig
Currently, the line-diff computation is in https://github.com/correctcomputation/analyse-conv/blob/main/src/create-stats.sh#L25; it computes more detailed, per-file statistics, but I don't think we need them. We will need to change the format of this file, and add to it the line counts from sloccount.
We compute pointer statistics on three versions. We run 3c
as already demanded herein (same flags, targets) for orig
and 3c-orig
, but with -dump-stats
. But we also run it on manual
(with the same flags as the other targets). For example, for manual
the command line could be
> 3c -alltypes -base-dir=manual -output-dir=/tmp -extra-arg-before=-I.. -extra-arg-before=-DTODD -dump-stats manual/assign.c manual/hcg.c manual/maze.c manual/vcg.c manual/channel.c manual/main.c manual/option.c
...
Summary
TotalConstraints|TotalPtrs|TotalNTArr|TotalArr|TotalWild
156|15|5|135|1
NumPointersNeedBounds:59,
NumNTNoBounds:0,
Details:
Invalid:58
,BoundsFound:
Array Bounds Inference Stats:
NamePrefixMatch:0
AllocatorMatch:0
VariableNameMatch:0
NeighbourParamMatch:0
DataflowMatch:0
Declared:1
The above output is stored in JSON format in the file TotalConstraintStats.json.aggregate.json
. Some other JSON files are generated, which we can discard.
We'll want to extend -dump-stats
to include added Checked C casts --- _Assume_bounds_cast
and Dynamic_cast
(are there others?). It also seems like the computed bounds information is off; we want to count things beyond just count(c)
and byte_count(b)
to also include bounds(p,q)
. I think we can ignore added _Unchecked
annotations, since this determination seems more subjective, and it's hard to match per-function annotations (what -addcr
does) to the whole-function #pragma
. See correctcomputation/checkedc-clang#565
For various programs herein, the 3c
calls fail:
/Users/mwh/checkedc/checkedc-clang/llvm/cmake-build-debug/bin/3c -addcr -alltypes -base-dir=orig -output-dir=3c-orig \
orig/Fheap.c orig/ft.c orig/item.c orig/Fsanity.c orig/graph.c orig/Fheap.h orig/Fsanity.h orig/Fstruct.h orig/graph.h orig/item.h --
/Users/mwh/checkedc/analyse-conv/src/Ptrdist/ft/orig/graph.c:65:8: warning: type specifier missing, defaults to 'int' [-Wimplicit-int]
static id = 1;
~~~~~~ ^
/Users/mwh/checkedc/analyse-conv/src/Ptrdist/ft/orig/Fsanity.h:50:19: error: unknown type name 'Heap'
int SanityCheck1(Heap * h, Item * i);
^
/Users/mwh/checkedc/analyse-conv/src/Ptrdist/ft/orig/Fsanity.h:50:29: error: unknown type name 'Item'
int SanityCheck1(Heap * h, Item * i);
^
/Users/mwh/checkedc/analyse-conv/src/Ptrdist/ft/orig/Fsanity.h:65:19: error: unknown type name 'Heap'
int SanityCheck2(Heap * h);
^
/Users/mwh/checkedc/analyse-conv/src/Ptrdist/ft/orig/Fsanity.h:83:19: error: unknown type name 'Heap'
int SanityCheck3(Heap * h, int rank);
^
/Users/mwh/checkedc/analyse-conv/src/Ptrdist/ft/orig/Fsanity.h:99:18: error: unknown type name 'Heap'
void PrettyPrint(Heap * h);
^
I notice that the header files are being passed on the command line to 3c, so I wonder if that might be part of the reason.
This behavior happens on MacOS and Linux (gamera).
In the diff
script that we use to compare the manual port and the 3c-reverted port, we should not count type instantiations that differ only because of the use of typedef
. For example, in yacr2
now, we have
< SCC = malloc<ulong>((channelNets + 1) * sizeof(ulong));
---
> SCC = malloc<unsigned long>((channelNets + 1) * sizeof(ulong));
The change should be to the src/filter.sh
file, which at present ignores _Unchecked
and #pragma
s. Like we do with _Unchecked
, the right thing to do is just remove the instantiations for free
, malloc
, calloc
, and perhaps others. I.e., after filtering the above will end up as
SCC = malloc((channelNets + 1) * sizeof(ulong));
One challenge here is that you can't just look for <[^>]*>
as a regexp, because there could be nested angle brackets, e.g., malloc<_Ptr<foo>>
. But maybe those don't come up enough that we need to filter them.
The pointer counts in the table seem like they should add up. For example
Program Version Lines Refactored Lines Annotated Lines Left #ptr #ntarr #arr #wild #bounds #casts
mst manual 129 (39.21 %) 28 (8.41 %) N/A 43 4 8 0 8 24
mst revert N/A 28 (8.41 %) 4 (1.20 %) 45 3 7 0 7 1
mst orig N/A 12 (3.65 %) 132 (40.12 %) 3 0 4 55 4 1
Notice that ptr+ntarr+arr+wild = 55 in the first two lines, but it's 55+3+4 = 63 for the last line.
I wonder if the reason is that the first two lines are counting manual and 3c-revert, which have the same code structure, whereas the last is counting 3c-orig, whose code might have been restructured; e.g., fewer or more pointers added, use of custom allocator, etc.
That's probably what it is, but we should understand the differences so we can explain them if they are striking.
For the latex table final.tex
I would like it to be grouped by benchmark, something like:
BM Program Version Lines Refactored Lines Annotated Lines Left #ptr #ntarr #arr #wild #bounds #casts
----
olde bh manual 121 (9.40 %) 45 (3.48 %) N/A 135 3 54 0 51 60
bh revert N/A 85 (6.57 %) 71 (5.42 %) 135 3 54 0 51 11
bh orig N/A 64 (4.97 %) 134 (10.35 %) 24 1 54 110 51 2
bisort manual 58 (22.05 %) 35 (13.73 %) N/A 44 2 2 0 2 10
bisort revert N/A 34 (13.33 %) 1 (0.39 %) 44 2 2 0 2 1
...
----
ptrd anagram manual 88 (26.27 %) 52 (14.44 %) N/A 16 4 32 4 31 26
anagram revert N/A 27 (7.50 %) 40 (11.17 %) 15 7 11 23 11 8
anagram orig N/A 27 (8.06 %) 100 (29.85 %) 12 7 12 13 11 3
ft manual 147 (16.63 %) 122 (13.74 %) N/A 169 1 2 0 2 20
ft revert N/A 122 (13.74 %) 0 (0.00 %) 169 1 2 0 2 0
...
Also, it would be good to shade the groups, e.g., gray for bh, then white for bisort, etc. to make the groups easier to pick out.
We either need to fix the benchmarks or fix 3c bug correctcomputation/checkedc-clang#543. In particular, this fragment in anagram.c
causes an assertion failure in 3c
:
#include <stdio.h>
#define fprintf(...) { (fprintf)(__VA_ARGS__); }
void Fatal(const char * pchMsg, unsigned u) {
fprintf(stderr, pchMsg, u);
}
The issue is the parentheses around the fprintf
in the macro. If we remove them, 3c
works fine. But will removing them cause C3
to fail? Figure this out; resolve.
We need to integrate vsftpd-port to this repo.
We should not copy the files here, though. Instead, to stay up to date with that repo, I would suggest:
src/vsftpd
in a way that matches the structure of the other project directories here (will involve switching branches to get the orig
files)The current approach expands macros, always. But we also want a version of the able for code without macros expanded, to stick in the paper appendix.
final.txt
does not seem to be including yacr2
for some reason. On gamera, things are getting built, but there's obviously some sort of error along the way.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.