Open source projects tipically follow a set of coding standard guidelines the involves cosmetical and technical guidelines that must be followed. The current VM does not have a proper coding standard, so we need to decided on something to adopt that we and new contributors must follow. Since fixing the codebase to follow a proper coding standard may involve a lot of work, the coding standard resulting from this discussion must be adopted at least for all the new code that we write, and for all the new public C interfaces that we expose to the public. The existent
For starting this discussion, here I am proposing a coding standard document with some specific details that needs to be discussed. The result of this discussion should be hosted at least in the GitHub wiki of this project. Unfortunately, GitHub does not support pull request of wiki pages, so we have to conduct this discussion manually.
VM Coding Style Guidelines
We need to define a coding convention standard for the new VM code.
The results of this discussion should be located on the wiki or in a new version of the CONTRIBUTING.md document.
Conding style guidelines include cosmetic, functional and quality aspects.
From a quality standpoint what is really important on this aspect is to be as
consistent as possible on the codebase (i.e. use the same coding style
everywhere).
I suggest keeping the following original guidelines from the original
VM source code, but this is clearly not enough:
Original VM coding guidelines from CONTRIBUTING.md
C Source Code Formatting
When editing an existing file, please be polite and first follow the
guidelines below, and secondly follow the general rule of keeping to the
same formatting conventions as exist in the file.
C Function Declarations
C function declarations should have the type on the same line as the
function name and parameters, e.g.:
static int convertCopy(char *from, int fromLen, char *to, int toLen, int term)
C Function Definitions
C function definitions should have the type on one line and the
function name and parameters on the following line, e.g.:
static int
convertCopy(char *from, int fromLen, char *to, int toLen, int term)
This facilitates searching for function definitions by searching for the
name at the beginning of the line.
Tab stops are 4 spaces
In this case the first rule is to keep the existing formatting within a
file - we don't want to polute the code history with space changes.
The goal is to use tabs set to 4 spaces.
Additional coding guidelines
Checking existing coding guidelines
The following are some popular coding style that used in the industry
and open source/free software projects:
We should take and adapt some of the guidelines that are used by these
coding standards.
Indentation style
A coding standard is not the same as an indentation and formatting
standard. A coding standard may have some .
For a list of indentation styles check:
https://en.wikipedia.org/wiki/Indentation_style . For my own projects
I typically use the Allman "BSD style" from that list. Another popular
indentation style is to use a variant from the Kernighan and Ritchie style
from that list. The K&R styles are more similar to the indentation style
that is typically used in Java.
An indentation style must be selected explicitly for the whole VM
source code project (excluding third party libraries). The selection
of this indentation style must be specified with the command line
parameters that have to be passed to the indent Unix utility.
New code should be written by following that indentation style,
and existing code should be converted by running the indent
utility on them. The indentation style typically can be specified on
IDEs such as Eclipse, and text editors such as Emacs so that they
automatically help on writing code following the indentation.
Specific guidelines
Single line comments
Single line comments using //
were not a part of the C standard until C99.
These comments are supported by the important compilers (gcc, clang, msvc).
The advantage of using these comment is that they facilitate using /**/
for
comment large blocks of code for debugging purposes. These same large blocks
of code can also be commented by #if 0 ... #endif
. Since these
comments are supported by the important compilers, then there is no reason for
not using them.
Include guards and #pragma once
The #pragma once
preprocessor directive is not part of the C standard.
However, mainstream C compilers such as gcc, clang and msvc do
support this pragma. The C standard does also mandates that
a #pragma
that is not understood by a compiler must be ignore.
The intention of this pragma is to improve compilation
speed, but because of its non-standard nature it should be used in
conjunction with include guards, in the following way:
#ifndef HEADERFOLDER_NAME_H_
#define HEADERFOLDER_NAME_H_
#pragma once
/**
Here goes the C header content.
*/
#endif //HEADERFOLDER_NAME_H_
Cast from void* to specific type
In C, casting implicitly from void* is an allowed operation, but this
is not allowed in C++. For example, the following is valid code in C,
but not in C++:
int *array = malloc(sizeof(int)*8);
The following version of the same code is valid in C and in C++:
int *array = (int*)malloc(sizeof(int)*8);
Since the second version is more explicit, I recommend that
we use that version just for the sake of being more explicit.
Order of includes in headers
According to the Google C++ coding standard, header files should be self contained and they should compile by it self. To facilitate this, the order of the includes should be the following:
- Project specific headers should come first.
- System provided headers should come after.
The intention of this ordering is to produce a compilation error
if a project specific header is missing the inclusion of a dependency.
Standard headers that are always present
The C language has the following two compilation modes:
- Free standing mode (-ffreestanding): this mode is typically used for embedded system or for constructing an operating system kernel.
- Hosted mode: the default mode with an operating system.
The following headers are mandated by the language standard to
be provided by the compiler on free standing mode, so it can be
assumed that they are always going to be present even if there is
no operating system:
- <float.h> (C99)
- <iso646.h> (C99)
- <limits.h> (C99)
- <stdalign.h> (C11)
- <stdarg.h> (C99)
- <stdbool.h> (C99)
- <stddef.h> (C99)
- <stdint.h> (C99)
- <stdnoreturn.h> (C11)
- <stdatomic.h> (C11)
From this list, the most important are <stddef.h>, <stdarg.h> and <stdint.h>.
stddef defines ptrdiff_t, size_t, wchar_t, NULL and offsetof. stdarg
defines macros for implementing functions with variable arguments such as
printf. stdint defines integer types with fixed size such as int32_t and
uint32_t. Since these headers are standard and they have to be always
provided by the compiler, then there is no reason to not use them.
Clang and GCC go for full C11 compatibility. MSVC implements only the
subsets of C99 and C11 that it also needs to implement a corresponding C++
standard (https://herbsutter.com/2012/05/03/reader-qa-what-about-vc-and-c99/).
Prefixes for namespacing
The lack of namespaces in C, and the usage of a single global symbol
table (specially on Linux. DLLs on Windows have their own symbol table.
OS X uses something similar to RTLD_DEEPBIND by default) has a huge potential
for producing name clashes. New code for the VM should use a prefix for
simulating namespaces in C which is a standard practice in C libraries.
This prefix could be realized by separating with an underscore:
int
pharovm_parseArgument(int argc, const char **argv)
{
//...
return 0;
}
Or by juxtaposition by using CamelCase:
int
pharoVMParseArgument(int argc, const char **argv)
{
//...
return 0;
}
I suggest using the version with underscores because it allows to have multiple namespaces, and we could use a different one for VM internals that should not be exposed to the public:
int
pharovm_internal_interpreter_pushOop(sqInt oop)
{
//...
}
We should decide what is this prefix going to be. The following are
some suggestions:
- pharo
- pharovm
- osvm (OpenSmalltalkVM)
Memory management hygiene
The explicit memory management in C is the source of several reliability
and security issues. For this reason we need some guidelines
when allocating memory, working with them, and disposing it.
alloca should not be used
alloca is a built-in primitive from the C compiler for allocating memory
on the stack. The implementation of alloca is done by decrementing the
stack pointer. On Linux only the memory for the stack of the main thread
can dynamically grow. The memory for threads different to the main
thread is allocated on thread creation time and it is fixed for
the whole lifetime of the thread. Since alloca can be used with large
sizes, there is no guarantee on having enough memory on the stack
for fullfilling the alloca, there is a strong potential for
provocating a stack underflow and an almost guaranteed segmentation fault
or memory corruption. For this reason the usage alloca is typically
forbidden by coding standards used in the industry.
In case where alloca is absolutely required, the size of the allocation
should be really small (no more than 100s of bytes). A local array with
the maximum size should always be preferred to alloca.
alloca should not be used for MAX_PATH for the same reason that a local
variable should not be used for MAX_PATH. The two idioms are equivalent.
malloced memory should be always fully inititialized
malloc returns uninitialized memory which can contain random garbage, but in
many cases it may look like zeroed memory. Some of this garbage may never be
initialized explicitly and could be interpreted in different random ways that
vary with the execution of the program. This usage of uninitialized memory may
be very hard to replicate and to debug. In some cases this uninitialized
memory may introduce subtle security bugs and leak sensitive information like
in the case of the infamous hearbleed bug in openssl. For this reason it is
recommended to do a memset after a malloc. For example:
#define BufferSize 4096
char *buffer = (char*)malloc(BufferSize);
memset(buffer, 0, BufferSize);
calloc should preferred over malloc
calloc always zeroes the allocated memory by doing an implicit memset. For this reason, calloc should be preferred to a raw malloc.
Initialization of structures
Structure should be always fully initialized. Most of the time it is
recommended to fully zero initialize structures by using the following idiom:
#include <stdlib.h> // For malloc and family
#include <string.h> // For memset
typedef struct some_struct_s
{
char firstField;
int secondField;
} some_struct_t;
// Global variables are always zero initialized.
some_struct_t globalVariable;
// In C, but not in C++, we can also initialize structures
// in the following way:
some_struct_t globalVariable2 = {
.firstField = 'A',
.secondField = 42
};
void someFunction()
{
// ...
// Local structure variable definition.
some_struct_t localVariable = {};
// DO NOT USE the following:
//struct some_struct localVariable = {0};
// Some compiler may only initialize the first field with this variant.
// Use memset with malloc
some_struct_t *pointer = (some_struct_t*)malloc(sizeof(some_struct_t));
memset(pointer, 0, sizeof(some_struct_t));
// Or use calloc
some_struct_t *otherPointer = (some_struct_t*)calloc(1, sizeof(some_struct_t));
// ...
}
String handling safeness
The usage of null terminated string literals in C is one of its greatest
design errors. This design has performance problems. Another problem with
string handling is that the string functions provided by the standard C
library are prone to produce buffer overflows (e.g: strcpy, strcat).
Common operations on strings require knowing the length of a
string for allocating memory to hold the result, which requires doing a
linear iteration per string just for asking for its size. The correct way of
fixing this is to prefix a string with its size (a la Pascal), but this
introduces a strong penalty on the code. Instead we are going to accept this
performance penalty, but we still need to use functions that operates on
strings and take care doing a correct allocation of the result buffer, and
provide safeness, security and reliability correctness when manipulating
string.
The basic operations that should be provided by this library are the
followings:
- String duplication.
- String concatenation.
- String formatting (wrapper on snprintf, but properly allocating the result buffer).
String functions from the standard library that do not receive the size of the
target should not be used (e.g: strcpy, strcat, sprintf). Instead versions, of
these functions that receive the size of the target buffer must be used
(strncat, snprintf). In the case of strncpy, it must be avoided because it
is designed for fixed sized strings without a null terminator, and strncpy
may not insert the required null terminator at the end location.
Pointers to constants
Functions that receive string arguments that are not mutated must have
pointer to constant values type. These types can be written in the following two equivalent ways:
void functionReceivingStringLiteral(const char *string);
void functionReceivingStringLiteral(char const *string);
For reasons of consistency, only one of these two ways of writing this type
must be used. The first way of defining this type is the most commonly used
way, but the second way may be easier to understand, and some files are
currently using it.
In C++ unlike C, there is no implicit cast from string literals to char*
,
so for example the following code is legal in C, but not in C++:
int executeImage(char *imageFileName);
int main()
{
return executeImage("Pharo.image");
}
Using const as much as possible it is recommended because allows using the
compiler to detect some errors. In the case of functions that belong to public
interface that is exposed to plugins or applications that are embedding the
VM, these headers should be usable from C++ code.
Operating system naming convention
The following naming prefixes should be used when writing operating system
specific code:
- For generic Unixes (Linux, FreeBSD, and to some extent OS X), they should be referred as Unix or abbreviated as unix. When using specific APIs from an Unix such as epoll in Linux vs kqueue in BSDs (including OS X), Unix specific variant naming and abbreviations should be used.
- Generic BSDs should be abbreviated as just bsd.
- OS X specific code should use osx. (Maybe we should use macos?)
- For Windows, windows in full, or win32, or w32 should be used. In some communities, using win for Windows is advised against (GNU coding standard: https://www.gnu.org/prep/standards/html_node/System-Portability.html#System-Portability https://www.gnu.org/prep/standards/html_node/Trademarks.html#Trademarks ). Even on 64 bits mode the API is still called win32.
- For the Windows App Store, the API is called UWP (Universal Windows Platform). This API is completely different from the Win32 API (except for a supported subset of Win), and it is only necessary to support it for some ARM devices, the Hololens and the Xbox One. On desktop the Win32 API is preferred.
C programming language standards
For reference, here are the draft versions of the current standards of the
C language:
These language standards should be used mostly as guidelines for
separating platforms independent features from platform specific features. GCC
and clang tend to follow the language standard closely, but MSVC tend to only
implement the standard partially. C99 features tend to be more widely
supported. From C11, the support for atomic operation seems to be the most
relevant feature for the VM.