cc [ flags ] -I/usr/local/include file(s) -L/usr/local/lib -lmcw [ ... ] #include <stdarg.h> (required only for the functions with va_list) arguments)
#include <stdio.h> extern int fscanf (FILE * restrict stream, const char * restrict format, ...); extern int scanf (const char * restrict format, ...); extern int sscanf (const char * restrict s, const char * restrict format, ...); extern int vfscanf (FILE * restrict stream, const char * restrict format, va_list arg); extern int vscanf (const char * restrict format, va_list arg); extern int vsscanf (const char * restrict s, const char * restrict format, va_list arg);
Using the scanf() and vscanf() functions is equivalent to supplying stdin as the first argument to fscanf() and vfscanf() respectively. The mathcw library implementation of these functions fully conforms to the 1989 and 1999 ISO Standards for the C programming language, but is available even with older pre-C99 compilers, and also offers several useful extensions:
- support for decimal floating-point datatypes;
- @ format conversion in bases 2 ... 36 for floating-point numbers;
- b and B binary floating-point format conversions;
- q and Q octal floating-point format conversions;
- y and Y binary integer format conversions;
- digit grouping.
A format string may consist of ordinary characters, which normally must match verbatim, and format specifications, which usually guide the conversion or use of the next argument.
The exception to verbatim matching of ordinary characters is whitespace: one or more consecutive whitespace characters (as determined by isspace(3)) in the format string match zero or more consecutive whitespace characters in the input. In particular, this convention makes it possible to introduce whitespace between format specifications to make format strings more readable.
Successive specifications consume additional arguments, and it is the caller's responsibility to ensure that the counts and datatypes of arguments and specifications match.
All arguments after the format string must be pointers, since they represent locations into which data are written. If there are too few arguments, or if they are of the wrong type, the user program may fail because of addressing errors, or produce erroneous conversions. Excess arguments are silently ignored.
A specification begins with a percent (%) character, and is followed by optional flags, an optional unsigned number, an optional datatype specifier, and finally, a single-character conversion specifier.
The format specification follows one of these templates:
%{flags}{datatype}[AaBbcdEeFfGginopQqsuXx%@] %{flags}w{datatype}[AaBbcdEeFfGginopQqsuXx%@]
Braces indicate a string of zero or more characters, and the braces are not included in the specification. Brackets indicate a set of characters, exactly one of which must be chosen.
Only a single flag character is recognized by the scanf() family:
- *
- Convert an input value, but suppress its assignment to an argument, and do not count it in the function return value.
Flag repetitions are permitted, but carry no additional meaning.
The w field is the maximum input field width. An omitted width value means that there is no limit on the size of the input field.
Leading zeros in the width are interpreted as decimal digits; they do not start an octal integer as they do elsewhere in the C-language family.
Adjacent digits in input numeric values in any supported number base may be separated by a single underscore, which does not affect the value. Thus, the input strings 3.141592653589793 and 3.141_592_653_589_793 specify the same approximation to pi (pi).
For based-number conversion with %@, a base out of the range 2 ... 36 is an error. Bases larger than 10 use successive letters of the modern English alphabet, just as hexadecimal notation uses the additional letters a ... f, and lettercase is ignored.
The exponent of based numbers is always a power of the base. By contrast, for binary, octal, and hexadecimal floating-point formats, the exponent is a power of two. Thus, the decimal value 255.0 can be written equivalently as 0x1.fep+7, 0xffp+0, 0o1.774p+7, 0b1.1111111p+7, and in based-number form as 16@f.f@e+1, 16@ff@e+0, 8@3.77@e+2, 2@1.1111111@e+7, and so on. While C89 supports only the decimal form, C99 also allows the hexadecimal form. However, the hoc(1) language recognizes all of them.
The datatypes for integer argument conversion are:
- hh
- signed or unsigned char [C99 extension];
- h
- signed or unsigned short int;
- l (ell)
- signed or unsigned long int;
- ll
- signed or unsigned long long int [C99 extension];
- j
- intmax_t or uintmax_t [C99 extension];
- z
- size_t [C99 extension];
- t
- ptrdiff_t [C99 extension].
The datatype specifiers are mandatory when arguments do not have default types (int, double, and char *), since they determine the amount of data stored via the pointer arguments.
The datatypes for binary floating-point argument conversion are:
- l
- double;
- L
- long double;
- LL
- long_long_double [mathcw extension].
The datatypes for decimal floating-point argument conversion are:
- H
- decimal_float [N1176 extension];
- DD
- decimal_double [N1176 extension];
- DL
- decimal_long_double [N1176 extension];
- DLL
- decimal_long_long_double [mathcw extension].
The H, DD, and DL datatypes follow the proposal ISO/IEC JTC1 SC22 WG14 N1176 Extension for the programming language C to support decimal floating-point arithmetic.
Integer conversions behave like those performed by the strtol(3) and strtoul(3) function family. The conversion types for integers are:
- d
- optionally-signed decimal;
- i
- optionally-signed decimal, or a number base determined by the input representation:
- leading 0b or 0B for binary [mathcw extension];
- leading 0 for octal;
- leading 0x or 0X for hexadecimal.
- n
- write the current input character count into the next argument, which must be of type int *, and is not counted in the function return value;
- o
- optionally-signed octal;
- u
- optionally-signed decimal;
- x
- optionally-signed hexadecimal;
- X
- same as x conversion;
- y
- optionally-signed binary [mathcw extension];
- Y
- same as y conversion [mathcw extension].
The conversion type for pointers is:
- p
- void *. Standard C defines the output form to be implementation dependent. In the mathcw library, %#x-style is used for modern 32-bit and 64-bit systems, while for the PDP-10, it follows tradition with %06o,,%06o, and on the PDP-11, with %0o . Platform-specific conventions may be provided for other systems as well.
If the input item for %p conversion was produced as output by the printf(3CW) family earlier in the same program execution, it refers to the same value. Otherwise, its meaning is undefined.
The conversion types for strings are:
- %
- Literal percent character. This is the only conversion type for which no argument can be consumed. In C89, flags, field widths, and datatype are normally omitted, so the format specification is usually written as %%. C99 permits only that form.
- c
- unsigned char, or with the l (ell) modifier, unsigned wchar_t. Without skipping leading whitespace, match exactly as many characters as the field width, or one character if a field width is not specified, and if the assignment-suppression flag is absent, store them in the area pointed to by the current argument. Do not append a trailing NUL.
- s
- signed or unsigned char *, or with the l (ell) modifier, wchar_t *. Any leading whitespace is skipped, and then the next consecutive non-whitespace characters are collected, a trailing NUL is appended, and if the assignment-suppression flag is absent, stored in the area pointed to by the current argument.
A maximum field width should always be specified to limit the number of characters stored in the area pointed to by the current argument.
- [...]
- signed or unsigned char *, or with the l (ell) modifier, wchar_t *. Without skipping leading whitespace, match the longest nonempty sequence of characters belonging to the specified scanset (the characters between the brackets in the format specification), and if the assignment-suppression flag is absent, store them, and a trailing NUL, in the storage area pointed to by the current argument, The number of input characters converted is at most the maximum field width, which does not count the extra space for the trailing NUL.
A maximum field width should always be specified to limit the number of characters stored in the area pointed to by the current argument.
The scanset is a sequence of one or more characters. If the initial character after the left bracket is a caret (^), then the scanset is the complement of the specified sequence, that is, all characters except those specified. If the first character following the left bracket, or left bracket and caret, is a right bracket, it is part of the set, rather than ending the scanset.
Except for special handling at the beginning of the scanset, no significance is attached to characters in the scanset, or to their order, or their repetition; in particular, hyphens in the scanset do not imply character ranges, as they do in bracketed regular-expression patterns.
For example, %6[01234567] matches a sequence of one to six octal digits, %[^aeiouAEIOU] matches a sequence of one or more characters that are not English vowels, %[AAA] matches a sequence of one or more A characters, %[][0123456789] matches a sequence of one or more decimal digits and brackets, and %[]] matches a sequence of one or more right brackets.
Floating-point conversions behave like those performed by the strtod(3) function family, although the mathcw library recognizes additional formats. The conversion types for floating-point values are:
- a
- optionally-signed floating-point number in any of these forms:
- decimal (-d.ddd...e+nn or -d.ddd...e+NN or d.ddd...);
- hexadecimal (-0xh.hhh...p+nn or -0Xh.hhh...P+nn) [C99 extension];
- binary (-0bh.hhh...p+nn or -0Bh.hhh...P+nn) [mathcw extension];
- octal (-0od.ddd...p+nn or -0Od.ddd...P+nn) [mathcw extension];
- based (-nn@d.ddd...@e+nn or -nn@d.ddd...@E+nn) [mathcw extension].
In addition to numeric values, the conversion recognizes optionally-signed strings inf and infinity for IEEE 754 Infinity and nan and nan(n-char-sequence) for IEEE 754 NaN (Not-a-Number). In all of these, lettercase is ignored. The meaning of the n-char-sequence is unspecified, apart from requiring balanced parentheses. In the mathcw implementation, it is ignored, and the stored input is a quiet NaN. NaN may be signed, but its sign is platform-dependent, and without meaning in IEEE 754 floating-point arithmetic. If the underlying arithmetic system does not support Infinity and NaN, then both produce a signed magnitude with the largest representable floating-point number.
TO DO: Ensure this guarantee for non-IEEE-754 systems.The mathcw implementation also recognizes qnan and qnan(n-char-sequence) for quiet NaNs, and snan and snan(n-char-sequence) for signaling NaNs. Note, however, that some early implementations of IEEE 754 arithmetic, such as the Intel IA-32 architecture, support only one kind of NaN; in such a case, all input NaN representations are treated as that single kind.
TO DO: Recognize QNaN and SNaN.- A
- same as %a conversion;
- b
- same as %a conversion [mathcw extension];
- B
- same as %a conversion [mathcw extension];
- e
- same as %a conversion;
- E
- same as %a conversion;
- f
- same as %a conversion;
- F
- same as %a conversion;
- g
- same as %a conversion;
- G
- same as %a conversion;
- q
- same as %a conversion [mathcw extension];
- Q
- same as %a conversion [mathcw extension];
- @
- same as %a conversion.
Here is a list of those areas in order of their appearance in Technical Corrigendum 2 of the 1999 ISO C Standard, each with a statement of how the mathcw library implementation behaves:
In the event of an error report on stderr, control immediately returns to the caller of the scanf() family routine with an error code of EOF (-1). The remainder of the format string, and all remaining arguments, are left unprocessed. However, the part of the specification before the error has already been processed, and may have produced input.
- If there are insufficient arguments for the format, the behavior is undefined.
Undetectable: program failure is likely.
- If this object [the one pointed to by the next unassigned argument] does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined.
Undetectable: program failure is likely.
- If a length modifier appears with any conversion specifier other than as specified above [in the list of length modifiers], the behavior is undefined.
Detected and reported on stderr.
- If a - character is in the scanlist [of a %[...] specification] and is not the first, nor the second where the first character is a ^, nor the last character, the behavior is implementation-defined.
Character ranges are not supported, so - is an ordinary character.
- the behavior of the %p conversion is undefined [if the value was not produced earlier in the same program execution].
Undetectable: program failure is likely if such a pointer is dereferenced.
- If the conversion specification [for %n] includes an assignment-suppressing character or a field width, the behavior is undefined.
Detected and reported on stderr,
- If a conversion specification is invalid, the behavior is undefined.
Detected and reported on stderr.
- If copying takes place between objects that overlap, the behavior is undefined.
Not detected. Program behavior is unpredictable.
- If the value of the result [of numeric conversions] cannot be represented, the behavior is undefined.
For integer conversions, overflow is undetected in the C language, and large positive values wrap to large-magnitude negative ones, and vice versa, possibly multiple times. For floating-point conversions, overflow produces an Infinity in IEEE 754 arithmetic, but the result may be unpredictable on older systems. Floating-point underflow may produce an IEEE 754 subnormal, or zero.
In the mathcw library, all functions in the scanf() family are short wrappers that call a common internal function to handle the format scanning and argument processing. They are thus guaranteed to behave identically, apart from where their input is read. This has not been true historically of other implementations on some systems, because the family members were introduced at different times, and may have different code.
- %c does not store a trailing NUL to terminate the string.
- Although %s and %[] do store a string-terminating NUL, unless a correct maximum field width is specified, characters can be written beyond the area pointed to by the current argument.
- The maximum field width for %s and %[] conversions does not count the extra character needed for the trailing NUL, but that subtle point is likely to be overlooked by programmers.
- Since all arguments after the format string are pointers, it is possible to store arbitrary data in those locations, and for %s conversion, unbounded amounts of data. If format strings can be altered, or provided, by the user, then a determined attacker may be able to write arbitrary data to arbitrary memory locations, causing program failure, or otherwise compromising security.
For both file and string input, a return of EOF can happen if an erroneous conversion specification is encountered. In such a case, an error message and the faulty specification are reported on stderr.