stats::tabulate
-- statistics
of duplicate rowsstats::tabulate
(s)
eliminates duplicate
rows in the sample s
and appends a column containing the
multiplicities.
stats::tabulate
(s, c1, c2, ..., f)
combines all rows that are identical except for entries in the
specified columns c1
, c2
etc. The function
f
is applied to these columns, its result replaces the
values in these columns.
stats::tabulate
(s, [c1, f1], [c2, f2],
...)
combines all rows that are identical except for entries in
the columns c1
, c2
etc. The functions
f1
, f2
etc. are applied to these columns, the
results replace the values in these columns.
stats::tabulate(s)
stats::tabulate(s, c1, c2... <, f>)
stats::tabulate(s, c1..c2, c3..c4... <, f>)
stats::tabulate(s, [c1, f1], [c2, f2]...)
stats::tabulate(s, [c1, c2..., f1], [c3, c4...,
f2]...)
s |
- | a sample of domain type stats::sample |
c1, c2, ... |
- | integers representing column indices of the sample
s |
f, f1, f2, ... |
- | procedures |
a sample of domain type stats::sample
.
stats::tabulate
regards rows as duplicates, if they
have identical entries in the columns that are not
specified.stats::tabulate
(s, c1, c2, ..., f)
the function f
is applied to the entries of the duplicate
rows in the specified columns. Duplicates are eliminated and replaced
by a single instance of the row, the result of f
is
inserted into the corresponding columns.
The function f
must accept as many parameters as there
are duplicates. Typical applications involve functions such as stats::mean
which accept
arbitrarily many arguments.
E.g., with stats::mean
duplicate rows are replaced by a single row, in which the entries of
the columns c1
, c2
etc. are replaced by the
mean values of the corresponding entries of the duplicates.
If no function f
is specified, then the default
function _plus
is
used.
If column indices are specified more than once, then extra columns with the result of the specified function are inserted into the sample.
stats::tabulate
(s, c1..c2, ..., f)
is a short hand notation for
stats::tabulate
(s, c1, c1+1, ..., c2, ...,
f)
.
stats::tabulate
(s, [c1, f1], [c2, f2],
...)
pairs of columns and corresponding procedures are
specified. Again, rows are regarded as duplicates, if they have
identical entries in the columns that are not specified.
Duplicates are eliminated and replaced by a single instance of the row,
the result of f1
is inserted in column c1
,
the result of f2
is inserted in column c2
etc.
If column indices are specified more than once, then extra columns with the result of the specified functions are inserted into the sample.
stats::tabulate
(s, [c1, c2, ..., f1],
...)
it is possible to apply functions that act on several
columns. The procedure f1
has to accept a sequence of
lists (each representing a column). The specified columns are replaced
by a single column containing the result of f1
. If column
indices are specified more than once, then extra columns with the
result of the specified function(s) are inserted into the sample. Cf.
examples 2 and 3.We create a sample:
>> s := stats::sample([[a, A, 1], [a, A, 1], [a, A, 2], [b, B, 5], [b, B, 10]])
a A 1 a A 1 a A 2 b B 5 b B 10
Duplicate rows of the sample are counted. There are four unique rows, one occurring twice:
>> stats::tabulate(s)
a A 1 2 a A 2 1 b B 5 1 b B 10 1
In the following call rows are regarded as duplicates, if the entries in the first two columns coincide. We compute the mean value of the third entry of the duplicates:
>> stats::tabulate(s, 3, stats::mean)
a A 4/3 b B 15/2
We compute both the mean and the standard deviation of the data in the third column for the sub-samples labeled 'a A' and 'b B' by the first two columns:
>> stats::tabulate(s, [3, stats::mean], [3, stats::stdev])
a A 4/3 1/3*2^(1/2) b B 15/2 5/2
>> delete s:
We create a sample containing columns for ``gender'', ``age'' and ``size'':
>> s := stats::sample([["f", 25, 166], ["m", 30, 180], ["f", 54, 160], ["m", 40, 170], ["f", 34, 170], ["m", 20, 172]])
"f" 25 166 "m" 30 180 "f" 54 160 "m" 40 170 "f" 34 170 "m" 20 172
We use stats::mean
on the second and third
column to calculate the average ``age'' and ``size'' of each
gender:
>> stats::tabulate(s, 2..3, float@stats::mean)
"f" 37.66666667 165.3333333 "m" 30.0 174.0
With the next call both the mean and the standard deviation of ``age'' and ``size'' for each gender are inserted into the sample.
>> stats::tabulate(s, [2, float@stats::mean], [2, float@stats::stdev], [3, float@stats::mean], [3, float@stats::stdev])
"f" 37.66666667 12.11977264 165.3333333 4.109609335 "m" 30.0 8.164965809 174.0 4.320493799
We compute the Bravais-Pearson correlation coefficient between ``age'' and ``size'' for each gender:
>> stats::tabulate(s, [2, 3, float@stats::BPCorr])
"f" -0.7540135992 "m" -0.1889822365
>> delete s:
We create a sample:
>> s := stats::sample([[a, x1, 1, 2], [b, x2, 2, 4], [b, x1, 2, 4], [e, x2, 3, 5.5]])
a x1 1 2 b x2 2 4 b x1 2 4 e x2 3 5.5
We regard rows with the same entry in the second column as ``of the same kind''. We tabulate the sample using different functions on the remaining columns:
>> stats::tabulate(s, [1, _plus], [3, _mult], [4, stats::mean])
a + b x1 2 3 b + e x2 6 4.75
One can apply customized procedures. In the following we
define the procedure plusmult
, which sums up the elements
of two lists (representing columns) and then multiplies the sums.
>> plusmult := proc(x, y) begin _plus(op(x))*_plus(op(y)) end_proc:
This procedure is then used to combine the first and the third column. Simultaneously, the mean and the standard deviation of the fourth column is inserted into the sample.
>> stats::tabulate(s, [1, 3, plusmult], [4, stats::mean], [4, stats::stdev])
3*a + 3*b x1 3 1 5*b + 5*e x2 4.75 0.75
>> delete plusmult, s: