Giter VIP home page Giter VIP logo

cwebbin's Introduction

literate programming in ansi-c/c++

cwebbin is an extension of silvio levy's and donald e. knuth's cweb system and donald e. knuth's ctwill program. it requires the contents of the original cweb source drop and the secondary ctwill source drop, to which it applies a set of change files to introduce advanced features. see the extensive readme for the full story.

feature list

  • includes ctwill and its utilities;
  • internationalization with the “GNU gettext utilities”;
  • temporary file output: check output for differences from former run with new option +c;
  • [only cweave and ctwill] option -l to change the first line in the tex output; options -i and -o for slightly customizable code layout;
  • [only ctangle] output can be redirected to @(/dev/{stdout,stderr,null}@>;
  • [only in “tex live”] file lookup with the kpathsea library.

manual compilation

extract ctwill.tar.gz and add the contents of cweb-4.11.tar.gz (overwriting outdated source files Makefile, common.h, common.w, and prod.w) and cwebbin-2023.tar.gz for the full set of source files. replace @@VERSION@@ in line 129 of the Makefile.unix with something like Version 4.11 [CWEBbin 2023]. touch *.cxx. unix/linux users should work with make -f Makefile.unix exclusively (targets boot, cautiously, and all). macos/bsd users will have to adapt Makefile.unix in several spots to make things work.

advanced packaging

alternatively, you may want to use rpmbuild or debbuild for compiling the sources and for creating installable packages in rpm and deb format. set up your build arena with mkdir BUILD BUILDROOT RPMS SOURCES SPECS SRPMS for rpmbuild (plus mkdir DEBS SDEBS for debbuild).

clone cweb and cwebbin, create the source drops with

git archive -o cweb-4.11.tar.gz cweb-4.11
git archive -o cwebbin-2023.tar.gz cwebbin-2023.3

respectively, put these two tarballs and the original ctwill.tar.gz in the SOURCES directory, add the patch files

  • 0001-Support-extended-syntax-for-numeric-literals.patch
  • 0002-Purge-redundant-TeX-macro.patch
  • 0003-Adapt-to-CWEB-4.5.patch
  • 0004-Add-silent-datecontentspage-macro.patch
  • 0005-Update-CTWILL-macros-for-CWEB-4.9.patch

to SOURCES also, and place cwebbin.spec in the SPECS directory of your build arena.

the five patch files upgrade the ctwill macros for modern cweb. originally, they come from branch update-macros-for-cweb-4 and can be recreated by git format-patch master in the archived ctwill project.

depending on your preferences run the magic incantation

{deb|rpm}build -ba SPECS/cwebbin.spec

cweb for texlive

the extended sources and the build system were modified to smoothly integrate with the texlive build system. by invoking

{deb|rpm}build -bi SPECS/cwebbin.spec --with texlive

you receive a small tarball cweb-texlive.tar.gz, which should be extracted in texlive's source directory texlive-source (or the subversion equivalent) with

cd /path/to/texlive-source
pax -rzf /path/to/cweb-texlive.tar.gz

this tarball contains *-w2c.ch files that modify the original cweb sources for the texlive ecosystem. additionally, it contains language catalogs, tex macros, and cweb include files.

updated versions of cweb are added to the texlive source tree with

cd /path/to/texlive-source/texk/web2c/cwebdir
pax -rzf /path/to/cweb-4.11.tar.gz

cwebbin's People

Contributors

ascherer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

cwebbin's Issues

Where to install POT/PO/MO in TeX Live?

Example from TeX Live 2018: texmf-dist/tex/generic/babel/locale/{de,et,al}

Standard (Ubuntu 16.04.5 LTS): /usr/share/locale/{de,et,al}/LC_MESSAGES/

Idea: texmf-var/we2c/cweb/locale/{de,et,al}/LC_MESSAGES/

Lookup: $ kpsewhich -var-value=TEXMF{ROOT,MAIN,DIST,VAR,HOME}

Self lookup (from kpathsea): kpse_selfdir(argv0);

Whatis entry in manual page of ctwill / cweb

This is rather a question than a bug report. I just build the Debian package providing the Debian binaries. For the manual pages of ctwill & cweb it complains:

W: bad-whatis-entry usr/share/man/man1/ctwill.1.gz
W: bad-whatis-entry usr/share/man/man1/cweb.1.gz

Indeed the entry behind .SH NAME is empty or separated from the expected content by .PP. The manual pages are generated by pandoc, hence it could be a bug in pandoc, not generating the manual page correctly. Could you have a short look if it is an issue in the md file or in pandoc? Thanks!

Can/will/should there be support for future ISO C language features?

Occasionally I like to check the publicly-available documents of ISO Working Group
14
to see what's new in the world of C.
There you can find various proposals and draft standards, the most recent one being
N2596. Most of the changes
are in semantics or in the standard library, but I was surprised to find that the syntax
of the language has been extended in two significant ways.

Even though this is a draft standard, it is a close approximation to what the next
official one will be. I'll use the present tense to be more understandable.

First, C now has binary integer literals, written with a 0b (or 0B) prefix, as in
0b101010. They were proposed in document
N2549
. This syntax is also
present in C++.

The more interesting change was in the introduction of “attributes”, based on a
recently-added C++ feature. They were proposed in document
N2335
. Attributes have the
general form of an identifier surrounded by double square brackets, as in [[whatever]].
Sometimes they can have a parenthesized argument after the identifier. They behave
somewhat like GCC's __attribute__, but of course they are standardized. Here is an
example of an attribute, based on paragraph 6.7.11.3.5 of the draft linked above:

int f(int i) {
      [[maybe_unused]] int j = i + 100;
      assert(j);
      return i;
}

The code declares that variable j might be unused; the implementation should not produce
a warning if it isn't. (It will be unused if NDEBUG is defined, so that assert expands
to ((void)0).) Thus it's basically like Common Lisp's (declare (ignore ...)).

The standard attributes currently defined are nodiscard (discourage use of value in void
expression), deprecated (indicate that something isn't to be used), fallthrough
(indicate that control in a switch statement is intended to fall through from one case
to another), and maybe_unused (indicate that something may not be used).

Unfortunately for those who want to cross-reference C code, attributes can appear in
troubling places. Here is the standard's example for the deprecated attribute
(6.7.11.4.7):

struct [[deprecated]] S {
      int a;
};

enum [[deprecated]] E1 {
      one
};

enum E2 {
      two [[deprecated("use ’three’ instead")]],
      three
};

[[deprecated]] typedef int Foo;

void f1(struct S s) { // Diagnose use of S
      int i = one; // Diagnose use of E1
      int j = two; // Diagnose use of two: "use ’three’ instead"
      int k = three;
      Foo f; // Diagnose use of Foo
}

[[deprecated]] void f2(struct S s) {
      int i = one;
      int j = two;
      int k = three;
      Foo f;
}

struct [[deprecated]] T {
      Foo f;
      struct S s;
};

The attribute appears before a declaration (Foo, f2), in the middle of a declaration
(S, E1, T), and after an enumeration constant in an enum declaration (E2). Note
that, syntactically, the grammar of the standard says that the attributes are actually a
part of the declaration.

Implementing binary literals in CWEAVE and CTANGLE is trivial, although the default
formatting of such constants would have to be decided upon. I don't think any changes to
CTANGLE are necessary to implement attributes. CWEAVE is another story. Getting them
to format correctly probably wouldn't be a huge challenge, but I believe there may be
problems with the indexing code. Specifically, when an identifier is to be underlined or
made reserved, find_first_ident is called; this function would find the attribute and
return it instead of the intended token.

One fix would be to make attributes a special case and “give up”, as CWEAVE does with
C++'s operator. Or they could be ignored in the search. Either way, the existing logic
could produce anomalous results.

I am not requesting that these changes be made, but rather asking whether these changes,
and any similar ones, should be made. (If so, then I would be happy to write the code
myself.) Maybe it is desired to keep the syntax to C99. Adding binary literals and
attributes would have the plus side of improving support for C++, if that means anything.
Other syntactic elements of (published recent standards of) C and C++ that CWEB doesn't
handle are the use of “universal character names” (e.g., x\U3401y) and string/character
constant encoding prefixes (e.g., U8"whatever"). Logically, I think, the prefix should
be written in typewriter style along with the rest of the string.

Thank you for your hard work in making this wonderful program more accessible.

Use C++ containers and strings.

Given that CWEBbin is compiled as C++ code anyway, it would be very advantageous to replace the complicated memory management in CWEB with standard containers like std::vector and std::string. This could also replace the *-memory.ch files.

Support UTF-8 identifiers in ctangle output.

Feature request and patch by @igor-liferenko.

Although I can apply the patch to ctangle.w and get the desired effect with the +u option, gcc 9.3.0 on (K)Ubuntu 20.04 LTS can't cope with identifiers with UTF-8 characters. It appears that UTF-8 support comes with gcc 10. Plus, there's a significant amount of spit and polish to be applied in order to integrate that small patch in the code base (test, doc, etc.).

Support XeTeX.

Use some “language” -lp+ to switch on PDF creation.

Let cweave react to option '-i' with function prototypes

By default, cweave should produce

int function(int a,@t\1\1@> /* comment for parameter |a| */
   int b, /* comment for |b| */
   int c@t\2\2@>) /* comment for |c| */
{
   /* body */
}

without the need for the extra @t material @>.

With option -i, the in- and out-denting of the parameters should be suppressed.

CWEBBIN doesn't acknowledge '-' for the change file name

Switching typedef boolean from (signed) short to bool introduced a type mismatch resp. a miscalculation for value found_change=-1, which in itself is a bit weird for a two-valued Boolean type.

It appears that GCC's bool type knows only value 0 as false and "all other values" equal true.

Purge redundant #includes in TeX Live CWEB.

diff --git a/texk/web2c/cwebdir/comm-w2c.ch b/texk/web2c/cwebdir/comm-w2c.ch
index 7087e7b04..ec8748525 100644
--- a/texk/web2c/cwebdir/comm-w2c.ch
+++ b/texk/web2c/cwebdir/comm-w2c.ch
@@ -81,6 +81,15 @@ common_init(void)
   @<Scan arguments and open output files@>@;
 @z
 
+@x l.100
+\.{ctype.h} header file.
+
+@<Include files@>=
+#include <ctype.h>
+@y
+\.{ctype.h} header file, included through the \Kpathsea/ interface.
+@z
+
 @x
 @d not_eq 032 /* `\.{!=}'\,;  corresponds to MIT's {\tentex\char'32} */
 @y
@@ -113,6 +122,13 @@ char *id_loc; /* just after the current identifier in the buffer */
 @d xisupper(c) (isupper((eight_bits)c)&&((eight_bits)c<0200))
 @z
 
+@x
+@ @<Include files@>=
+#include <stdio.h>
+@y
+@ Most of the standard \CEE/ interface comes from \Kpathsea/.
+@z
+
 @x
 int input_ln(fp) /* copies a line into |buffer| or returns 0 */
 FILE *fp; /* what file to read from */
@@ -1129,11 +1145,8 @@ extern char* strncpy(); /* copy up to $n$ string characters */
 @y
 @ For string handling we include the {\mc ANSI C} system header file instead
 of predeclaring the standard system functions |strlen|, |strcmp|, |strcpy|,
-|strncmp|, and |strncpy|.
+|strncmp|, and |strncpy|; this comes through the \Kpathsea/ interface.
 @^system dependencies@>
-
-@<Include...@>=
-#include <string.h>
 @z
 
 @x
@@ -1181,7 +1194,6 @@ standard C types for boolean values, pointers, and objects with fixed sizes.
 @<Include files@>=
 #include <stdbool.h> /* type definition of |bool| */
 #include <stddef.h> /* type definition of |ptrdiff_t| */
-#include <stdint.h> /* type definition of |uint8_t| et al. */
 
 @ The |scan_args| and |cb_show_banner| routines and the |bindtextdomain|
 argument string need a few extra variables.

Add option '-o' to ctangle.

Restore original behaviour of ctangle by allowing unconditional -overwriting of the output file(s). (This could also be -force.)

Both options are already used in cweave.

Rephrase option '-t' as '-c'.

nuweb uses '-c' all along: "Avoid testing output files for change before updating them."

'-t' will be used by upstream cweb.

Redact CTWILL.

  • Split long sections by adding @ .
  • Move broken sections to next page with @r.
  • Fix mini-indexes.

Update branch 'cweb-ansi'

There has been quite some activity in the past half year that has not been reflected in the cweb-ansi branch. It may prove useful to have a setting for compiling "only" the "plain CWEB" sources with a modern C compiler.

Write a new Non-ASCII-to-TeX conversion file

We need a new utf8.sty file along the lines of ecma94.sty. Also a new utf8.w transliteration table similar to ecma94.w might be useful – this will require an extended @l directive for multi-byte encoded characters.

Unicode

There are a couple open issues about UTF-8 and Unicode. I was going to write this as a comment on one of them, but I wanted to make a new issue to address Unicode support in general.

(I'm happy to begin working on Unicode implementation, as soon as the issues mentioned below are discussed.)


I have been contemplating what it would take to integrate Unicode into CWEB.

There are several things to consider. I am assuming that UTF-8 is the only input/output encoding that need be supported.

What should the internal representation of characters be?

  • Keeping them in UTF-8 form is attractive because the code can continue using char without fear; however, at some point a certain amount of decoding is required. The full extent depends on how much error checking we want to do and on the preferred action of CTANGLE. As an asthetic choice, eight_bits or a new, synonymous type octet could be substituted for char when the value is an octet of UTF-8 input.
  • UTF-16 is, I think you'll agree, a silly choice. To adopt it would have no benefits that I can see over UTF-32, other than that it takes less space.
  • Decoding the input fully into UTF-32 form, storing every character's full code point, is a viable strategy. One advantage is that encoding/decoding code can be separated from the parts of the programs that work with characters in memory. Unfortunately, all code would have to be modified to work with uint_fast32_t or whatever (probably hidden behind a code_point typedef) instead of char. The other major issue is that ASCII characters, which constitute the majority of typical C text, unconditionally occupy four times more storage than is necessary. But this isn't the greatest concern nowadays. It is convenient that every character takes up a single value.

The programs often advance to the next character in a string by incrementing a pointer by 1. If UTF-8 is chosen as the internal representation, then all such increments will have to be adjusted to compensate. Using UTF-32 would avoid this problem.

In summary, storing characters in UTF-32 form takes up more space, forces encoding/decoding, and requires altering most declarations related to characters; storing characters in UTF-8 saves space and allows declarations to remain unchanged, but most operations on characters would have to be changed.

Encoding or decoding could happen at the following points:

  • When storing names for sorting (see the heading “Collation” below).
  • When CTANGLE is reading @''. We probably want to extend the notation so that it “expands” into the ordinal value of any single character in the string, provided that that character corresponds to one code point. (Thus no notice is taken of combining characters.)
  • When CTANGLE is converting names for output, if it must transliterate (see the heading “Transliteration” below).

It might be easier to do encoding/decoding manually, not by trying to use any of C's “wide character” facilities. (Frankly, I find them obnoxious. Also, many uses of C input/output functions would have to be changed.)

One good thing about UTF-8 is that it is quite naturally expressed in octal, so CWEB's preference could be maintained through the transition.

Unicode character data

In any case, the hardest part about supporting Unicode beyond simple encoding and decoding is dealing with the Unicode character database. Unicode 13.0 assigns (gives meaning to) 143 859 out of 1 114 112 possible code points. Every character has many properties that describe it.

Unicode distributes a bunch of plain text files that contain the property data for all characters. Unfortunately, there is no file that consolidates all information into one place, except for the Unicode XML database.

I'm going to ignore the task of reading the data in for now. The more interesting problem is this: How do we store information about every character? A full implementation of Unicode would be forced to have a way to get the value of any property, but CWEB needs only a limited set.

Width.

CWEB's error reporting routine indicates the current position in the buffer by printing it out like this:

first part of line
                  second, unread part of line

The problem is that the code assumes that all characters occupy the same amount of horizontal space. In reality, some characters have no width, some are wider than one column, etc. The amount of effort it would take to get this correct probably far outweighs the utility of the feature. But it's certainly possible; GCC handles cursor position in Unicode input just fine.

Transliteration.

For CTANGLE, we must be able to associate some string of text with a character, defining its transliteration. All that's needed is a char *.

C99 and C++98 added a syntactic feature called a “universal character name”, which is basically a four- or eight-digit hexadecimal character code embedded in regular source text. For example, a\u200Bb gives you ab, where the two characters are separated by a zero-width space. According to Annex D of the C standard and lex.name.allowed in the C++ standard, this is a perfectly valid identifier. However, both languages prohibit many characters to appear as universal character names in identifiers. It is tempting to change CTANGLE's default transliteration to insert an equivalent universal character name, but the restrictions complicate matters.

Normalization.

Some strings of Unicode characters are effectively identical while not being exactly (i.e., numerically) equal. For example, a precomposed character like “ü” (U+00FC LATIN SMALL LETTER U WITH DIAERESIS) should usually be treated identically to its decomposed counterpart “ü” (U+0075 LATIN SMALL LETTER U and U+0308 COMBINING DIAERESIS).

Therefore Unicode defines (in UAX 15) a process of normalization, which converts strings to a canonical form. There are a few kinds of normalization, depending on whether you want to tend towards decomposing characters or towards composing characters and how you want to handle compatibility characters.

Several properties are associated with normalization, including Canonical_Combining_Class (a nonnegative integer below 256), Decomposition_Type (one of sixteen values), and Decomposition_Mapping (a string of at most eighteen code points).

It would probably be best for CWEB to normalize all strings before entering them into the character/byte memory.

Identifiers.

If we want “extended characters” to be allowed in identifiers, we need to know exactly which code points can begin an identifier and which code points can continue an identifier. Luckily there are properties just for this, thanks to UAX 31. Specifically, if a character has the property XID_Start, it can begin an identifier, and if a character has the property XID_Continue, it can be a part of an identifier.

(There are also ID_Start and ID_Continue. The X variants are for normalized text only.)

Collation.

Here's the big one. The entirety of CWEAVE's Phase III is devoted to sorting and outputting an index. Sorting the index involves putting names in order, according to a collating sequence; in the current version of CWEAVE, the collation is represented by the collate array. Unicode collation is much more complex, due to the expanded character set.

Full details of the Unicode collation algorithm can be found in UTS 10. It is based on four levels of comparison between strings. The specification requires that strings be normalized before comparison.

Collation needs a collation element table to work. The Default Unicode Collation Table (DUCET) can be found here; like the rest of the Unicode data, it is stored in a plain text file. In the DUCET, only three of the four levels of comparison are used, in order to allow implementations to extend the order for whatever internal reason. Other collation element tables exist for specific languages or conventions.

Storing the data.

In general, we want a way to map a twenty-one-bit number (probably held in a thirty-two-bit integer) to some data structure containing the character properties we are interested in. Storing all the needed information straightforwardly in a statically-allocated array would occupy about 45 megabytes on a sixty-four-bit system. I'm counting

  1. The transliteration (char *)
  2. Canonical combining class (uint8_t)
  3. Decomposition type (short)
  4. Decomposition mapping (char * or code_point * depending on the internal
    representation of characters)
  5. XID start (bool)
  6. XID continue (bool)
  7. Collation element (struct { uint16_t a, b, c, d; })

We would have the transliteration string be NULL if no transliteration was given; then CTANGLE would compute it automatically.

I think that more attributes must be stored for normalization, so 45 megabytes is really a lower bound.

There are many ways of compressing this, of course. Full Unicode implementations typically use a kind of trie for looking up properties, because the entire set of properties for a
single character takes up a lot of space. Compression is also possible because long runs of characters tend to share properties.

Since CWEAVE doesn't do transliteration, and since CTANGLE doesn't do collation, the two areas of storage could put into a union.

Actually getting the data.

I glossed over this earlier, but it's important. How can CWEB read the character information into memory? There is far too much to compile directly into the programs; should it be read at initialization? Ideally we could do what TeX does and save the program's state after initialization, but I'm not sure if there is a good, portable way.

The property information we want is found in the files UnicodeData.txt, DerivedCoreProperties.txt, allkeys.txt, and DerivedNormalizationProps.txt. Thus if CWEAVE or CTANGLE are starting up from scratch, they must read in four very large files.

Alternatively, we could write a program to extract only the relevant data from the relevant files and write it in an especially compact form to a new file, which would be read by CWEAVE and CTANGLE. I think that the most recent version of such a file should be distributed with CWEB, but I can certainly see arguments to the contrary.

[The program could be a more general utility (serving as another example of CWEB) that creates a compressed file containing a specified set of properties for each character. For instance, you might want to know only the names and aliases of characters; you can open the program, enter “name,alias”, and it would output a file accordingly.]


Or use a library.

I'm against this option. One of CWEB's appeals is that it is very easy to set up. It has no dependencies except on the C standard library; all you need is a C compiler to run CWEB. Existing Unicode implementations are bulky and annoying, and they wouldn't fit in with the rest of CWEB.

Flow for building manual with xdvipdfmx breaks

If the build is set to use dvipdfm (the default) instead of pdftex, the build stage for fullmanual will fail. It will run dvipdfm ctangle, which will try to load the ctangle executable in the current directory, instead of ctangle.dvi.

(Note: In Debian-based distros, the default version of dvipdfm is xdvipdfmx; I don't know if this following is a problem under stock dvipdfm.)

ctangle -> ctangle.pdf
DVI ID = 0

xdvipdfmx:fatal: Something is wrong. Are you sure this is a DVI file?

Output file removed.
Makefile.unix:212: recipe for target 'ctangle.pdf' failed
make[1]: Leaving directory '/tmp/cwebbin-cwebbin-22p'
make[1]: *** [ctangle.pdf] Error 1
make: *** [fullmanual] Error 2
Makefile.unix:309: recipe for target 'fullmanual' failed
The command '/bin/sh -c make -f Makefile.unix all' returned a non-zero code: 2

The repro I used for this is in https://gist.github.com/piquan/937fbc682090838870210af996102e85 , along with a patch for both this issue and #9 (which shadows this issue in that repro). To reproduce the dvipdfm issue being described here, delete the part of cwebbin.diff starting from @@ -210,7 +210,7 @@ onwards.

Bootstrapping issue: cannot find @i files during initial compile

While doing an initial build of cwebbin, I run into a problem: while it's running ctangle, it's looking for the @i files in the installed location, not at the ones in the build tree. That means that if cwebbin isn't yet installed, it can't build itself.

./ctangle +s ctangle ctang-22p.ch ctangle.cxx
This is CTANGLE (Version 3.64 [22p])
*1
! Cannot open include file. (l. 26 of include file comm-22p.h)
@i iso_types.w

An example Dockerfile showing the workflow that led up to this is in https://gist.github.com/piquan/937fbc682090838870210af996102e85 . There's also a patch in that gist that fixes both this issue and #10.

Complete web2c-help.pot

Add all msgids from texk/web2c/help,h. Installed at a central plain in TeX Live this would permit all web2c sub-programs – if translated with “GNU gettext utilities” – to work with a common translation pack.

ctwill and table of contents ToC

ctwill does not produce a ToC, even with the option +x. According to the man documentation,

       • -x: omit indices, section names, table of contents

Running ctwill-twinx attempt.tex > index.tex creates a master index, but there is no table of contents.

Is this by design?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.