SCANF 3CW "29 January 2007" "mathcw-1.00"

Table of contents


NAME

fscanf, scanf, sscanf, vfscanf, vscanf, vsscanf - formatted-input routines

SYNOPSIS

cc [ flags ] -I/usr/local/include file(s) -L/usr/local/lib -lmcw [ ... ]

#include <stdarg.h> (required only for the functions with va_list) arguments)

#include <stdio.h> extern int fscanf (FILE * restrict stream, const char * restrict format, ...); extern int scanf (const char * restrict format, ...); extern int sscanf (const char * restrict s, const char * restrict format, ...); extern int vfscanf (FILE * restrict stream, const char * restrict format, va_list arg); extern int vscanf (const char * restrict format, va_list arg); extern int vsscanf (const char * restrict s, const char * restrict format, va_list arg);

DESCRIPTION

Functions in the scanf() family convert human-readable text representations of numerical, character, and pointer data to internal ones according to specifications in the format-string argument.

Using the scanf() and vscanf() functions is equivalent to supplying stdin as the first argument to fscanf() and vfscanf() respectively. The mathcw library implementation of these functions fully conforms to the 1989 and 1999 ISO Standards for the C programming language, but is available even with older pre-C99 compilers, and also offers several useful extensions:

A format string may consist of ordinary characters, which normally must match verbatim, and format specifications, which usually guide the conversion or use of the next argument.

The exception to verbatim matching of ordinary characters is whitespace: one or more consecutive whitespace characters (as determined by isspace(3)) in the format string match zero or more consecutive whitespace characters in the input. In particular, this convention makes it possible to introduce whitespace between format specifications to make format strings more readable.

Successive specifications consume additional arguments, and it is the caller's responsibility to ensure that the counts and datatypes of arguments and specifications match.

All arguments after the format string must be pointers, since they represent locations into which data are written. If there are too few arguments, or if they are of the wrong type, the user program may fail because of addressing errors, or produce erroneous conversions. Excess arguments are silently ignored.

A specification begins with a percent (%) character, and is followed by optional flags, an optional unsigned number, an optional datatype specifier, and finally, a single-character conversion specifier.

The format specification follows one of these templates:

%{flags}{datatype}[AaBbcdEeFfGginopQqsuXx%@]
%{flags}w{datatype}[AaBbcdEeFfGginopQqsuXx%@]

Braces indicate a string of zero or more characters, and the braces are not included in the specification. Brackets indicate a set of characters, exactly one of which must be chosen.

Only a single flag character is recognized by the scanf() family:

*
Convert an input value, but suppress its assignment to an argument, and do not count it in the function return value.

Flag repetitions are permitted, but carry no additional meaning.

The w field is the maximum input field width. An omitted width value means that there is no limit on the size of the input field.

Leading zeros in the width are interpreted as decimal digits; they do not start an octal integer as they do elsewhere in the C-language family.

Adjacent digits in input numeric values in any supported number base may be separated by a single underscore, which does not affect the value. Thus, the input strings 3.141592653589793 and 3.141_592_653_589_793 specify the same approximation to pi (pi).

For based-number conversion with %@, a base out of the range 2 ... 36 is an error. Bases larger than 10 use successive letters of the modern English alphabet, just as hexadecimal notation uses the additional letters a ... f, and lettercase is ignored.

The exponent of based numbers is always a power of the base. By contrast, for binary, octal, and hexadecimal floating-point formats, the exponent is a power of two. Thus, the decimal value 255.0 can be written equivalently as 0x1.fep+7, 0xffp+0, 0o1.774p+7, 0b1.1111111p+7, and in based-number form as 16@f.f@e+1, 16@ff@e+0, 8@3.77@e+2, 2@1.1111111@e+7, and so on. While C89 supports only the decimal form, C99 also allows the hexadecimal form. However, the hoc(1) language recognizes all of them.

The datatypes for integer argument conversion are:

hh
signed or unsigned char [C99 extension];
h
signed or unsigned short int;
l (ell)
signed or unsigned long int;
ll
signed or unsigned long long int [C99 extension];
j
intmax_t or uintmax_t [C99 extension];
z
size_t [C99 extension];
t
ptrdiff_t [C99 extension].

The datatype specifiers are mandatory when arguments do not have default types (int, double, and char *), since they determine the amount of data stored via the pointer arguments.

The datatypes for binary floating-point argument conversion are:

l
double;
L
long double;
LL
long_long_double [mathcw extension].

The datatypes for decimal floating-point argument conversion are:

H
decimal_float [N1176 extension];
DD
decimal_double [N1176 extension];
DL
decimal_long_double [N1176 extension];
DLL
decimal_long_long_double [mathcw extension].

The H, DD, and DL datatypes follow the proposal ISO/IEC JTC1 SC22 WG14 N1176 Extension for the programming language C to support decimal floating-point arithmetic.

Integer conversions behave like those performed by the strtol(3) and strtoul(3) function family. The conversion types for integers are:

d
optionally-signed decimal;
i
optionally-signed decimal, or a number base determined by the input representation:
  • leading 0b or 0B for binary [mathcw extension];
  • leading 0 for octal;
  • leading 0x or 0X for hexadecimal.
n
write the current input character count into the next argument, which must be of type int *, and is not counted in the function return value;
o
optionally-signed octal;
u
optionally-signed decimal;
x
optionally-signed hexadecimal;
X
same as x conversion;
y
optionally-signed binary [mathcw extension];
Y
same as y conversion [mathcw extension].

The conversion type for pointers is:

p
void *. Standard C defines the output form to be implementation dependent. In the mathcw library, %#x-style is used for modern 32-bit and 64-bit systems, while for the PDP-10, it follows tradition with %06o,,%06o, and on the PDP-11, with %0o . Platform-specific conventions may be provided for other systems as well.

If the input item for %p conversion was produced as output by the printf(3CW) family earlier in the same program execution, it refers to the same value. Otherwise, its meaning is undefined.

The conversion types for strings are:

%
Literal percent character. This is the only conversion type for which no argument can be consumed. In C89, flags, field widths, and datatype are normally omitted, so the format specification is usually written as %%. C99 permits only that form.
c
unsigned char, or with the l (ell) modifier, unsigned wchar_t. Without skipping leading whitespace, match exactly as many characters as the field width, or one character if a field width is not specified, and if the assignment-suppression flag is absent, store them in the area pointed to by the current argument. Do not append a trailing NUL.
s
signed or unsigned char *, or with the l (ell) modifier, wchar_t *. Any leading whitespace is skipped, and then the next consecutive non-whitespace characters are collected, a trailing NUL is appended, and if the assignment-suppression flag is absent, stored in the area pointed to by the current argument.

A maximum field width should always be specified to limit the number of characters stored in the area pointed to by the current argument.

[...]
signed or unsigned char *, or with the l (ell) modifier, wchar_t *. Without skipping leading whitespace, match the longest nonempty sequence of characters belonging to the specified scanset (the characters between the brackets in the format specification), and if the assignment-suppression flag is absent, store them, and a trailing NUL, in the storage area pointed to by the current argument, The number of input characters converted is at most the maximum field width, which does not count the extra space for the trailing NUL.

A maximum field width should always be specified to limit the number of characters stored in the area pointed to by the current argument.

The scanset is a sequence of one or more characters. If the initial character after the left bracket is a caret (^), then the scanset is the complement of the specified sequence, that is, all characters except those specified. If the first character following the left bracket, or left bracket and caret, is a right bracket, it is part of the set, rather than ending the scanset.

Except for special handling at the beginning of the scanset, no significance is attached to characters in the scanset, or to their order, or their repetition; in particular, hyphens in the scanset do not imply character ranges, as they do in bracketed regular-expression patterns.

For example, %6[01234567] matches a sequence of one to six octal digits, %[^aeiouAEIOU] matches a sequence of one or more characters that are not English vowels, %[AAA] matches a sequence of one or more A characters, %[][0123456789] matches a sequence of one or more decimal digits and brackets, and %[]] matches a sequence of one or more right brackets.

Floating-point conversions behave like those performed by the strtod(3) function family, although the mathcw library recognizes additional formats. The conversion types for floating-point values are:

a
optionally-signed floating-point number in any of these forms:
  • decimal (-d.ddd...e+nn or -d.ddd...e+NN or d.ddd...);
  • hexadecimal (-0xh.hhh...p+nn or -0Xh.hhh...P+nn) [C99 extension];
  • binary (-0bh.hhh...p+nn or -0Bh.hhh...P+nn) [mathcw extension];
  • octal (-0od.ddd...p+nn or -0Od.ddd...P+nn) [mathcw extension];
  • based (-nn@d.ddd...@e+nn or -nn@d.ddd...@E+nn) [mathcw extension].

In addition to numeric values, the conversion recognizes optionally-signed strings inf and infinity for IEEE 754 Infinity and nan and nan(n-char-sequence) for IEEE 754 NaN (Not-a-Number). In all of these, lettercase is ignored. The meaning of the n-char-sequence is unspecified, apart from requiring balanced parentheses. In the mathcw implementation, it is ignored, and the stored input is a quiet NaN. NaN may be signed, but its sign is platform-dependent, and without meaning in IEEE 754 floating-point arithmetic. If the underlying arithmetic system does not support Infinity and NaN, then both produce a signed magnitude with the largest representable floating-point number.
TO DO: Ensure this guarantee for non-IEEE-754 systems.

The mathcw implementation also recognizes qnan and qnan(n-char-sequence) for quiet NaNs, and snan and snan(n-char-sequence) for signaling NaNs. Note, however, that some early implementations of IEEE 754 arithmetic, such as the Intel IA-32 architecture, support only one kind of NaN; in such a case, all input NaN representations are treated as that single kind.
TO DO: Recognize QNaN and SNaN.

A
same as %a conversion;
b
same as %a conversion [mathcw extension];
B
same as %a conversion [mathcw extension];
e
same as %a conversion;
E
same as %a conversion;
f
same as %a conversion;
F
same as %a conversion;
g
same as %a conversion;
G
same as %a conversion;
q
same as %a conversion [mathcw extension];
Q
same as %a conversion [mathcw extension];
@
same as %a conversion.

IMPLEMENTATION LIMITS

TO DO: are there any input limits?

IMPLEMENTATION-DEFINED BEHAVIOR

There are several locations in the descriptions of the scanf() function family where the ISO C Standards leave behavior unspecified, or declare it to be implementation defined. Such imprecision is a barrier to portability, since user code that exploits the behavior of one particular library implementation is likely to misbehave, or fail, when linked with another implementation, either on the same system (such as might happen by choosing a different compiler), or on a different system.

Here is a list of those areas in order of their appearance in Technical Corrigendum 2 of the 1999 ISO C Standard, each with a statement of how the mathcw library implementation behaves:

In the event of an error report on stderr, control immediately returns to the caller of the scanf() family routine with an error code of EOF (-1). The remainder of the format string, and all remaining arguments, are left unprocessed. However, the part of the specification before the error has already been processed, and may have produced input.

In the mathcw library, all functions in the scanf() family are short wrappers that call a common internal function to handle the format scanning and argument processing. They are thus guaranteed to behave identically, apart from where their input is read. This has not been true historically of other implementations on some systems, because the family members were introduced at different times, and may have different code.


SECURITY ISSUES

There are some significant security issues with the functions in the scanf() family:

RETURN VALUES

Functions in the scanf() family return the number of items converted (that is, the number of arguments assigned to, excluding %n conversions), or EOF (-1) on error.

ERRORS

For file input, filesystem errors, such as unreadable storage blocks, or empty or truncated files, can result in an immediate return of EOF at any input character.

For both file and string input, a return of EOF can happen if an erroneous conversion specification is encountered. In such a case, an error message and the faulty specification are reported on stderr.


SEE ALSO

cvtib(3CW), cvtid(3CW), cvtig(3CW), cvtih(3CW), cvtio(3CW), fclose(3), fgetc(3), fopen(3), fprintf(3CW), fputc(3), fputs(3), fread(3), fwrite(3), getc(3), getchar(3), getw(3), hoc(1), isspace(3), its4(1), printf(3CW), putc(3), putchar(3), puts(3), rats(1), snprintf(3CW), splint(1), sprintf(3CW), strtod(3), strtol(3), strtoul(3), ungetc(3), ungetwc(3), vfprintf(3CW), vprintf(3CW), vsnprintf(3CW), vsprintf(3CW).