Undefined Behavior Categorization
  2022-12-26

P1705R1 [1] makes a comprehensible list of undefined behaviors present in the standard. my goal is to categorize such a list for C89, by doing so i hope that i can provide a more accessible view of what types of ub exists in the standard and an in depth view of what each ub means. it seems that P1705R1 breaks down ub in the following categories: * lex * basic * expr * stmt * dcl * class again, this list is for c++ so for c the list will be different, but i hope that i can port as much material form this list to c.
c89 can be found here [2]. they dump all the undefined behaviors in A.6.2. they don't give examples for the specific ubs and they don't give a categorization for classes of undefined behaviors. this is my task today.
following is the list of categorized undefined behavior. the list is still in a primordial state, expect refinments in the future. up until now we have the following categories for c89 undefined behaviors: * translation phase (7) * indentifiers (7) * storage (18) * functions (8) * pointers (7) * preprocessing (13) * std library (10) * signals (2) * variable arguments (7) * streams (2) * print/scan (11) * memory alloc (2) translation phase: 1 A nonempty source file does not end in a new-line character, ends in new-line character immediately preceded by a backslash character, or ends in a partial preprocessing token or comment ($2.1.1.2). 2 A character not in the required character set is encountered in a source file, except in a preprocessing token that is never converted to a token, a character constant, a string literal, or a comment ($2.2.1). 3 A comment, string literal, character constant, or header name contains an invalid multibyte character or does not begin and end in the initial shift state ($2.2.1.2). 4 An unmatched ' or character is encountered on a logical source line during tokenization ($3.1). 5 An unspecified escape sequence is encountered in a character constant or a string literal ($3.1.3.4). 6 A character string literal token is adjacent to a wide string literal token ($3.1.4). 7 The characters ', \ , , or /* are encountered between the < and > delimiters or the characters ', \ , or /* are encountered between the delimiters in the two forms of a header name preprocessing token ($3.1.7). identifiers: 1 The same identifier is used more than once as a label in the same function ($3.1.2.1). 2 An identifier is used that is not visible in the current scope ($3.1.2.1). 3 Identifiers that are intended to denote the same entity differ in a character beyond the minimal significant characters ($3.1.2). 4 The same identifier has both internal and external linkage in the same translation unit ($3.1.2.2). 5 An identifier with external linkage is used but there does not exist exactly one external definition in the program for the identifier ($3.1.2.2). 6 An identifier for an object is declared with no linkage and the type of the object is incomplete after its declarator, or after its init-declarator if it has an initializer ($3.5). 7 An identifier for an object with internal linkage and an incomplete type is declared with a tentative definition ($3.7.2). storage: * The value stored in a pointer that referred to an object with automatic storage duration is used ($3.1.2.4). * Two declarations of the same object or function specify types that are not compatible ($3.1.2.6). * An attempt is made to modify a string literal of either form ($3.1.4). * An arithmetic conversion produces a result that cannot be represented in the space provided ($3.2.1). * An lvalue with an incomplete type is used in a context that requires the value of the designated object ($3.2.2.1). * The value of a void expression is used or an implicit conversion (except to void ) is applied to a void expression ($3.2.2.2). ?* An object is modified more than once, or is modified and accessed other than to determine the new value, between two sequence points ($3.3). * An arithmetic operation is invalid (such as division or modulus by 0) or produces a result that cannot be represented in the space provided (such as overflow or underflow) ($3.3). * An object has its stored value accessed by an lvalue that does not have one of the following types: the declared type of the object, a qualified version of the declared type of the object, the signed or unsigned type corresponding to the declared type of the object, the signed or unsigned type corresponding to a qualified version of the declared type of the object, an aggregate or union type that (recursively) includes one of the aforementioned types among its members, or a character type ($3.3). * An object is assigned to an overlapping object ($3.3.16.1). * A bit-field is declared with a type other than int , signed int , or unsigned int ($3.5.2.1). * An expression is shifted by a negative number or by an amount greater than or equal to the width in bits of the expression being shifted ($3.3.7). * An attempt is made to modify an object with const-qualified type by means of an lvalue with non-const-qualified type ($3.5.3). * An attempt is made to refer to an object with volatile-qualified type by means of an lvalue with non-volatile-qualified type ($3.5.3). * The value of an uninitialized object that has automatic storage duration is used before a value is assigned ($3.5.7). * An object with aggregate or union type with static storage duration has a non-brace-enclosed initializer, or an object with aggregate or union type with automatic storage duration has either a single expression initializer with a type other than that of the object or a non-brace-enclosed initializer ($3.5.7). * An attempt is made to copy an object to an overlapping object by use of a library function other than memmove ($4.). * The result of an integer arithmetic function ( abs , div , labs , or ldiv ) cannot be represented ($4.10.6). functions: * An argument to a function is a void expression ($3.3.2.2). * For a function call without a function prototype, the number of arguments does not agree with the number of parameters ($3.3.2.2). * For a function call without a function prototype, if the function is defined without a function prototype, and the types of the arguments after promotion do not agree with those of the parameters after promotion ($3.3.2.2). * If a function is called with a function prototype and the function is not defined with a compatible type ($3.3.2.2). * A function that accepts a variable number of arguments is called without a function prototype that ends with an ellipsis ($3.3.2.2). * A function is declared at block scope with a storage-class specifier other than extern ($3.5.1). * The value of a function is used, but no value was returned ($3.6.6.4). * A function that accepts a variable number of arguments is defined without a parameter type list that ends with the ellipsis notation ($3.7.1). pointers: * An invalid array reference, null pointer reference, or reference to an object declared with automatic storage duration in a terminated block occurs ($3.3.3.2). * A pointer to a function is converted to point to a function of a different type and used to call a function of a type not compatible with the original type ($3.3.4). * A pointer to a function is converted to a pointer to an object or a pointer to an object is converted to a pointer to a function ($3.3.4). * A pointer is converted to other than an integral or pointer type ($3.3.4). * A pointer that is not to a member of an array object is added to or subtracted from ($3.3.6). * Pointers that are not to the same array object are subtracted ($3.3.6). * Pointers are compared using a relational operator that do not point to the same aggregate or union ($3.3.8). preprocessing: * The token defined is generated during the expansion of a #if or #elif preprocessing directive ($3.8.1) * The #include preprocessing directive that results after expansion does not match one of the two header name forms ($3.8.2). * A macro argument consists of no preprocessing tokens ($3.8.3). * There are sequences of preprocessing tokens within the list of macro arguments that would otherwise act as preprocessing directive lines ($3.8.3). * The result of the preprocessing concatenation operator ## is not a valid preprocessing token ($3.8.3). * The #line preprocessing directive that results after expansion does not match one of the two well-defined forms ($3.8.4). * One of the following identifiers is the subject of a #define or #undef preprocessing directive: defined , __LINE__ , __FILE__ , __DATE__ , __TIME__ , or __STDC__ ($3.8.8). * The effect if the program redefines a reserved external identifier ($4.1.2). * The effect if a standard header is included within an external definition; is included for the first time after the first reference to any of the functions or objects it declares, or to any of the types or macros it defines; or is included while a macro is defined with a name the same as a keyword ($4.1.2). * A macro definition of errno is suppressed to obtain access to an actual object ($4.1.3). * The parameter member-designator of an offsetof macro is an invalid right operand of the . operator for the type parameter or designates bit-field member of a structure ($4.1.5). * The macro definition of assert is suppressed to obtain access to an actual function ($4.2). * A macro definition of setjmp is suppressed to obtain access to an actual function ($4.6). std library: * A library function argument has an invalid value, unless the behavior is specified explicitly ($4.1.6). * A library function that accepts a variable number of arguments is not declared ($4.1.6). * The argument to a character handling function is out of the domain ($4.3). * An invocation of the setjmp macro occurs in a context other than as the controlling expression in a selection or iteration statement, or in a comparison with an integral constant expression (possibly as implied by the unary ! operator) as the controlling expression of a selection or iteration statement, or as an expression statement (possibly cast to void ) ($4.6.1.1). * An object of automatic storage class that does not have volatile-qualified type has been changed between a setjmp invocation and a longjmp call and then has its value accessed ($4.6.2.1). * The longjmp function is invoked from a nested signal routine ($4.6.2.1). * The result of converting a string to a number by the atof , atoi , or atol function cannot be represented ($4.10.1). * A program executes more than one call to the exit function ($4.10.4.3). * The shift states for the mblen , mbtowc , and wctomb functions are not explicitly reset to the initial state when the LC_CTYPE category of the current locale is changed ($4.10.7). * An array written to by a copying or concatenation function is too small ($4.11.2, $4.11.3). * An invalid conversion specification is found in the format for the strftime function ($4.12.3.5). signals: * A signal occurs other than as the result of calling the abort or raise function, and the signal handler calls any function in the standard library other than the signal function itself or refers to any object with static storage duration other than by assigning a value to a static storage duration variable of type volatile sig_atomic_t ($4.7.1.1). * The value of errno is referred to after a signal occurs other than as the result of calling the abort or raise function and the corresponding signal handler calls the signal function such that it returns the value SIG_ERR ($4.7.1.1). variable arguments: * The macro va_arg is invoked with the parameter ap that was passed to a function that invoked the macro va_arg with the same parameter ($4.8). * A macro definition of va_start , va_arg , or va_end or a combination thereof is suppressed to obtain access to an actual function ($4.8.1). * The parameter parmN of a va_start macro is declared with the register storage class, or with a function or array type, or with a type that is not compatible with the type that results after application of the default argument promotions ($4.8.1.1). * There is no actual next argument for a va_arg macro invocation ($4.8.1.2). * The type of the actual next argument in a variable argument list disagrees with the type specified by the va_arg macro ($4.8.1.2). * The va_end macro is invoked without a corresponding invocation of the va_start macro ($4.8.1.3). * A return occurs from a function with a variable argument list initialized by the va_start macro before the va_end macro is invoked ($4.8.1.3). streams: * The stream for the fflush function points to an input stream or to an update stream in which the most recent operation was input ($4.9.5.2). * An output operation on an update stream is followed by an input operation without an intervening call to the fflush function or a file positioning function, or an input operation on an update stream is followed by an output operation without an intervening call to a file positioning function ($4.9.5.3). print/scan: * The format for the fprintf or fscanf function does not match the argument list ($4.9.6). * An invalid conversion specification is found in the format for the fprintf or fscanf function ($4.9.6). * A %% conversion specification for the fprintf or fscanf function contains characters between the pair of % characters ($4.9.6). * A conversion specification for the fprintf function contains an h or l with a conversion specifier other than d , i , n , o , u , x , or X , or an L with a conversion specifier other than e , E , f , g , or G ($4.9.6.1). * A conversion specification for the fprintf function contains a # flag with a conversion specifier other than o , x , X , e , E , f , g, or G ($4.9.6.1). * A conversion specification for the fprintf function contains a 0 flag with a conversion specifier other than d , i , o , u , x , X , e, E , f , g , or G ($4.9.6.1). * An aggregate or union, or a pointer to an aggregate or union is an argument to the fprintf function, except for the conversion specifiers %s (for an array of character type) or %p (for a pointer to void ) ($4.9.6.1). * A single conversion by the fprintf function produces more than 509 characters of output ($4.9.6.1). * A conversion specification for the fscanf function contains an h or l with a conversion specifier other than d , i , n , o , u , or x , or an L with a conversion specifier other than e , f , or g ($4.9.6.2). * A pointer value printed by %p conversion by the fprintf function during a previous program execution is the argument for %p conversion by the fscanf function ($4.9.6.2). * The result of a conversion by the fscanf function cannot be represented in the space provided, or the receiving object does not have an appropriate type ($4.9.6.2). memory alloc: * The value of a pointer that refers to space deallocated by a call to the free or realloc function is referred to ($4.10.3). * The pointer argument to the free or realloc function does not match a pointer earlier returned by calloc , malloc , or realloc , or the object pointed to has been deallocated by a call to free or realloc ($4.10.3).
it's not werid that the most undefined behviors fall into the category of storage. at the end of the day this is what makes the difference between hardware architectures. next there is a surprising number of preprocessing undefined behaviors. why did they decide to add so many ub in this cateogry is a question i cannot answer for the moment. but i am lucky that these ubs cannot be used for compiler optimizations. so even if they are many, they are useless for my goal. next we have standard library undefined behaviors. somehow you expect to be a large numbers of ubs here. we talk about functions that are not feed the right arguments. you expect to have ub if you cannot use the functions in the manner designated by the standard. another suprise is the large number of print/scan ubs. again, i don't care about them that much because they do not affect optimizations that much. or do they? are there cases where printf semantics were changed in the presence of ub optimizations? who knows? next we have functions, translation phase, identifiers, pointers and variable arguments. they might change the semantics of the code in unexpected ways. here i need to spend some more time to discover the secrets of these undefined behaviors. next we have signals, streams and memory alloc. we all know how dangerous memory alloc is, even if the number of ub is that little. for signals i also expect interesting things. streams on the other hand do not seem that special in this context.
the initial intent of mine was to analyze each group of undefined behaviors and understand better if each ub can influence in some way the compiler when it issues optimizations based on ub. but i won't do that in this article beacuse it will get too long and i'm not a fan of this kind of writing, expecially in the current format of the blog where it's hard to move from one part of the article to anothre. so that's it for today.
note: it seems that other people have done this categorization before me. i should take a look in the future at [3,4] because they have different categorizations than mine. [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1705r1.html [2] https://port70.net/~nsz/c/c89/c89-draft.txt [3] https://www.cs.ru.nl/bachelors-theses/2017/Matthias_Vogelaar___4372913___Defining_the_Undefined_in_C.pdf [4] https://solidsands.com/wp-content/uploads/Master_Thesis_Vasileios_GemistosFinal.pdf