The analysis of proportions is of two primary types.

- Focus on a single value of a categorical variable, termed a “success” when it occurs, for one or more samples of data. Analyze the resulting proportion of occurrence for a single sample, or compare proportions of successes across distinct data samples for a single variable, a
*homogeneity*test. - Compare the obtained proportions across the values of one or more categorical variables for a single sample. Applied to a single variable, the analysis is a
*goodness-of-fit*. Or, evaluate a potential relationship between two categorical variables, a test of*independence*.

From standard base R functions, the **lessR** function `Prop_test()`

, abbreviated `prop()`

, provides either type of analysis. To use, generally enter either the original data from which to compute the frequencies and then the sample proportions, or enter already computed frequencies. For the analysis of multiple categorical variables across two levels of one of the variables, the test of *homogeneity* and the test of *independence* yield the identical statistical result.

The following table summarizes the values of the `Prop_test()`

parameters for different analyses of proportions. Each function call for the analysis of data begins with the name of a categorical variable, generically referred to as `X`

. The value of `X`

is the first parameter in the function definition, and so does not need its parameter name, `variable`

. If needed, indicate a second categorical variable, generically referred to as `Y`

, with the `by`

parameter. If focused on a specific value of `X`

as a success, referred to as `X_value`

, indicate that value with the `success`

parameter.

Run each analysis either directly from pre-computed values of the sample proportions, or from the original data from which the sample proportions are calculated.

Evaluate | Data Parameters | Count Parameters |
---|---|---|

A hypothesized proportion | X, `success` =X_value |
`n_succ` , `n_tot` [scalars] |

Equal proportions across samples | X, `success` =X_value, `by` =Y |
`n_succ` , `n_tot` [vectors] |

Uniform goodness-of-fit | X | `ntot` [vector] |

Independence of two variables | X, `by` =Y |
`n_table` |

The remainder of this vignette illustrates these applications of `Prop_test()`

.

Refer to the occurrence of a designated value of the `variable`

as a `success`

. Define all other values of the variable as failures. Success or failure in this context does not necessarily mean good or bad, desired or undesired, but instead, a designated value either occurred or did not.

When analyzing proportions from data, first indicate the categorical variable, the value of the parameter `variable`

. Next, indicate the value of `variable`

for the parameter `success`

. When entering proportions directly, indicate the number of successes and the total number of trials with the `n_succ`

and `n_tot`

parameters. Enter the value of each parameter either as a single value for one sample or as a vector of multiple values for multiple samples. Without a value for `success`

or `n_succ`

the analysis is of goodness-of-fit or independence.

The example below is from the documentation for the base R function `binom.test()`

, which provides the exact test of a null hypothesis regarding the probability of success. `Prop_test()`

uses that base R function to compare a sample proportion to a hypothesized population value.

For a given categorical variable of interest, a type of plant, consider two values, either “giant” or “dwarf”. From a sample of 925 plants, the specified value of “giant” occurred 682 times and did not occur 243 times. The null hypothesis tested is that the specified value occurs for 3/4 of the population according to the `p0`

parameter.

`Prop_test(n_succ=682, n_fail=243, p0=.75)`

```
##
## >>> Exact binomial test of a proportion <<<
##
## ------ Description ------
##
## Number of successes: 682
## Number of failures: 243
## Number of trials: 925
## Sample proportion: 0.737
##
## ------ Inference ------
##
## Hypothesis test for null of 0.75, p-value: 0.382
## 95% Confidence interval: 0.708 to 0.765
```

To illustrate with data, read the *Jackets* data file included with **lessR** into the data frame *d*. The file contains two categorical variables. The variable *Bike* represents two different types of motorcycle: BMW and Honda. The second variable is *Jacket* with three values of jacket thickness: Lite, Med, and Thick. Because *d* is the default name of the data frame that contains the variables for analysis, the `data`

parameter that names the input data frame need not be specified.

`<- Read("Jackets") d `

```
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Bike character 1025 0 2 BMW Honda Honda ... Honda Honda BMW
## 2 Jacket character 1025 0 3 Lite Lite Lite ... Lite Med Lite
## ------------------------------------------------------------------------------------------
```

In following example, for the `variable`

*Bike* from the default *d* data frame, define the parameter `success`

as the value *“BMW”*. The default null hypothesis is a population value of 0.5, but here explicitly specify with the parameter `p0`

.

For clarity, the following example includes the parameter names listed with their corresponding values. These names are unnecessary in this example, however, because the values are listed in the same order of their definition of the `Prop_test()`

function.

`Prop_test(variable=Bike, success="BMW", p0=0.5)`

```
##
## >>> Exact binomial test of a proportion <<<
##
## Variable: Bike
## success: BMW
##
## ------ Description ------
##
## Number of missing values: 0
## Number of successes: 418
## Number of failures: 607
## Number of trials: 1025
## Sample proportion: 0.408
##
## ------ Inference ------
##
## Hypothesis test for null of 0.5, p-value: 0.000
## 95% Confidence interval: 0.378 to 0.439
```

Reject the null hypothesis, with a \(p\)-value of 0.000, less than \(\alpha = 0.05\). The sample result of the sample proportion \(p=0.408\) is considered far from the hypothesized value of \(0.5\) for the proportion of `"BMW"`

values for *Bike*. Conclude that the data were sampled from a population with a population proportion of BMW different from 0.5.

The following example is from the base R `prop.test()`

documentation, which the **lessR** `Prop_test()`

relies upon to compare proportions across different groups.

The null hypothesis in this example is that the four populations of *patients* from which the samples were drawn have the same population proportion of *smokers*. The alternative is that at least one population proportion is different. Label the groups in the output by providing a named vector for the successes.

To indicate multiple proportions across groups, provide multiple values for the `n_succ`

and `n_tot`

parameters. Optionally, name the groups.

```
<- c(83, 90, 129, 70)
smokers names(smokers) <- c("Group1","Group2","Group3","Group4")
<- c(86, 93, 136, 82)
patients Prop_test(n_succ=smokers, n_tot=patients)
```

```
##
## >>> 4-sample test for equality of proportions without continuity correction <<<
##
##
## >>> Description
##
## Group1 Group2 Group3 Group4
## ----------- ------- ------- ------- -------
## n_ 83 90 129 70
## n_total 86 93 136 82
## proportion 0.965 0.968 0.949 0.854
##
## >>> Inference
##
## Chi-square statistic: 12.600
## Degrees of freedom: 3
## Hypothesis test of equal population proportions: p-value = 0.006
```

The result of the test is \(p\)-value \(=0.006 < \alpha=0.05\), so reject the null hypothesis of equal probabilities across the corresponding four populations. Conclude that at least one of the population proportions of smokers differ.

In the following example, duplicate the previous results, but in this example from data. To illustrate, create the data frame *d* according to the proportions of smokers and non-smokers with respective values “smoke” and “nosmoke”. Of course, in actual data analysis the data would already be available.

```
<- c(rep("smoke", 83), rep("nosmoke", 3))
sm1 <- c(rep("smoke", 90), rep("nosmoke", 3))
sm2 <- c(rep("smoke", 129), rep("nosmoke", 7))
sm3 <- c(rep("smoke", 70), rep("nosmoke", 12))
sm4 <- c(sm1, sm2, sm3, sm4)
sm <- c(rep("A",86), rep("B",93), rep("C",136), rep("D",82))
grp <- data.frame(sm, grp) d
```

To test if the different groups have the same population proportion of `success`

, retain the syntax for a single proportion for the categorical `variable`

of interest. Define success by the value of this variable, here *“smoke”*. However, an additional parameter `by`

indicates the variable that defines the groups, a variable that contains a label that identifies the corresponding group for each row of data. The grouping variable in this example is *grp*, with values the first four uppercase letters of the alphabet. The first five rows of data are shown below.

`head(d)`

```
## sm grp
## 1 smoke A
## 2 smoke A
## 3 smoke A
## 4 smoke A
## 5 smoke A
## 6 smoke A
```

The relevant parameters `variable`

, `success`

, and `by`

are listed in their given order in this example, so the parameter names are unnecessary. List the names for clarity.

`Prop_test(variable=sm, success="smoke", by=grp)`

```
##
## >>> 4-sample test for equality of proportions without continuity correction <<<
##
## Variable: sm
## success: smoke
## by: grp
##
## >>> Description
##
## A B C D
## ----------- ------ ------ ------ ------
## n_smoke 83 90 129 70
## n_total 86 93 136 82
## proportion 0.965 0.968 0.949 0.854
##
## >>> Inference
##
## Chi-square statistic: 12.600
## Degrees of freedom: 3
## Hypothesis test of equal population proportions: p-value = 0.006
```

The analysis of data that matches the previously input proportions, of course, provides the same results as providing the proportions directly.

For the previously discussed test of homogeneity of the values of a single categorical variable, the proportion of occurrences for a specific value across different samples is of interest. Here, instead calculate the proportion of occurrence for each value from the total number of occurrences, as one sample from a single population. In addition to the inference test, the following are also reported: - The observed and expected frequencies - The residual of expected from observed - The standardized version of the residual

For the goodness-of-fit test to a uniform distribution, provide the frequencies for each group for the parameter `n_tot`

. The default null hypothesis is that the proportions of the different categories of a categorical variable are equal.

In this example, enter three frequencies as a vector for the `n_tot`

parameter value. Optionally, make the vector a named vector to label the output accordingly.

```
= c(372, 342, 311)
x names(x) = c("Lite", "Med", "Thick")
Prop_test(n_tot=x)
```

```
##
## >>> Chi-squared test for given probabilities <<<
##
##
## >>> Description
##
## Lite Med Thick
## --------- -------- -------- --------
## observed 372 342 311
## expected 341.667 341.667 341.667
## residual 1.641 0.018 -1.659
## stdn res 2.010 0.022 -2.032
##
## >>> Inference
##
## Chi-square statistic: 5.446
## Degrees of freedom: 2
## Hypothesis test of equal population proportions: p-value = 0.066
```

This example does not quite attain significance at the customary 5% level, with \(p\)-value \(= 0.066 > \alpha = 0.05\). A difference of the corresponding population proportions was not detected.

The same analysis follows from the data. Just specify the name of the categorical `variable`

of interest.

`<- Read("Jackets", quiet=TRUE) d `

`Prop_test(Jacket)`

```
##
## >>> Chi-squared test for given probabilities <<<
##
## Variable: Jacket
##
## >>> Description
##
## Lite Med Thick
## --------- -------- -------- --------
## observed 372 342 311
## expected 341.667 341.667 341.667
## residual 1.641 0.018 -1.659
## stdn res 2.010 0.022 -2.032
##
## >>> Inference
##
## Chi-square statistic: 5.446
## Degrees of freedom: 2
## Hypothesis test of equal population proportions: p-value = 0.066
```

Tests of independence evaluated here rely upon a contingency table of two dimensions also called a cross-tabulation table or joint frequency table. Enter the joint frequencies directly or compute from the data. The corresponding analysis provides the chi-square test for the null hypothesis of independence.

Also provided is Cramer’s V to indicate the extent of the relationship of the two categorical variables. For each cell frequency, the expected value given the independence assumption is provided, along with the corresponding residual from the observed frequency and the corresponding standardized residual.

To enter the joint frequency table directly, store the frequencies in a file accessible from your computer system. One possibility is to enter the numbers into a text file with file type `.csv`

or `.txt`

. Enter the numbers with a text editor, or with a word processor saving the file as a text file. This file format separates the adjacent values in each row with a comma, as indicated below. Or, enter the numbers into an MS Excel formatted file with file type `.xlsx`

. Enter only the numeric frequencies, no labels.

For example, consider the following joint frequency table with four levels of the column variable and four levels of the row variable, here in `csv`

format.

```
3,58,6,105
41,79,9,207
86,179,27,484
143,214,31,824
```

After saving the file, call `Prop_test()`

using the parameter `n_table`

to indicate the path name to the file, enclosed in quotes. Or, leave the quotes empty to browse for the joint frequency table.

This table is included in a file downloaded with **lessR** with the name *FreqTable99*. That name triggers an internal process that locates the file within the *lessR* installation without needing to construct a rather complicated path name as part of this example. That also means that the name becomes a reserved key word with its use always triggering the following example.

In general, replace *FreqTable99* in this example with your own path name to your file of joint frequencies, or just delete the name leaving only the two quotes to indicate to browse for the file.

`Prop_test(n_table="FreqTable99")`

```
##
## >>> Pearson's Chi-squared test <<<
##
## >>> Description
##
## Cell Frequencies
## 3 58 6 105
## 41 79 9 207
## 86 179 27 484
## 143 214 31 824
##
## Cramer's V: 0.075
##
## Row Col Observed Expected Residual Stnd Res
## 1 1 3 18.812 -15.812 -4.003
## 1 2 58 36.522 21.478 4.150
## 1 3 6 5.030 0.970 0.455
## 1 4 105 111.635 -6.635 -1.098
## 2 1 41 36.750 4.250 0.799
## 2 2 79 71.346 7.654 1.098
## 2 3 9 9.827 -0.827 -0.288
## 2 4 207 218.077 -11.077 -1.361
## 3 1 86 84.875 1.125 0.156
## 3 2 179 164.776 14.224 1.504
## 3 3 27 22.696 4.304 1.105
## 3 4 484 503.654 -19.654 -1.781
## 4 1 143 132.562 10.438 1.339
## 4 2 214 257.356 -43.356 -4.246
## 4 3 31 35.447 -4.447 -1.057
## 4 4 824 786.635 37.365 3.135
##
## >>> Inference
##
## Chi-square statistic: 41.732
## Degrees of freedom: 9
## Hypothesis test of equal population proportions: p-value = 0.000
```

Do not have the path name to your file readily available? Then browse for the file. The following example is not run as it cannot run in this vignette.

`Prop_test(n_table="")`

The full path name for the file is provided as part of the output.

The \(\chi^2\) test of independence evaluated here applies to two categorical variables. The first categorical variable listed in this example is the value of the parameter `variable`

, the first parameter in the function definition, so does not need the parameter name. The second categorical variable listed must include the parameter name `by`

.

The question for the analysis is if the observed frequencies of *Jacket* thickness and *Bike* ownership sufficiently differ from the frequencies expected by the null hypothesis that we conclude the variables are related.

`Prop_test(Jacket, by=Bike)`

```
## variable: Jacket
## by: Bike
##
## >>> Pearson's Chi-squared test <<<
##
## >>> Description
##
## Jacket
## Bike Lite Med Thick Sum
## BMW 89 135 194 418
## Honda 283 207 117 607
## Sum 372 342 311 1025
##
## Cramer's V: 0.319
##
## Row Col Observed Expected Residual Stnd Res
## 1 1 89 151.703 -62.703 -8.288
## 1 2 135 139.469 -4.469 -0.602
## 1 3 194 126.827 67.173 9.287
## 2 1 283 220.297 62.703 8.288
## 2 2 207 202.531 4.469 0.602
## 2 3 117 184.173 -67.173 -9.287
##
## >>> Inference
##
## Chi-square statistic: 104.083
## Degrees of freedom: 2
## Hypothesis test of equal population proportions: p-value = 0.000
```

The result of this test is that the \(p\)-value = 0.000 \(< \alpha=0.05\), so reject the null hypothesis of independence. Conclude that the type of *Bike* a person rides and the thickness of their *Jacket* are related.

To visualize the relationship of the two variables, use the same function call syntax, but now to `BarChart()`

instead of `Prop_test()`

. The visualization is accompanied by the same \(\chi^2\) test of independence.

`BarChart(Jacket, by=Bike)`

```
## >>> Suggestions
## Plot(Jacket, Bike) # bubble plot
## BarChart(Jacket, by=Bike, horiz=TRUE) # horizontal bar chart
## BarChart(Jacket, fill="steelblue") # steelblue bars
##
##
## Joint and Marginal Frequencies
## ------------------------------
##
## Jacket
## Bike Lite Med Thick Sum
## BMW 89 135 194 418
## Honda 283 207 117 607
## Sum 372 342 311 1025
##
##
## Cramer's V: 0.319
##
## Chi-square Test: Chisq = 104.083, df = 2, p-value = 0.000
```

The visualization depicts the relationship between motorcycle and jacket: Honda riders prefer thinner jackets, and BMW riders prefer thicker jackets. To speculate, perhaps because the BMW bikes are sportier, their riders are more concerned with going down on the pavement.

This relationship becomes even clearer to visualize with the corresponding 100% stack bar graph. Each bar representing a jacket choice in this visualization shows the percentage of riders with each type of motorcycle for that jacket.

`BarChart(Jacket, by=Bike, stack100=TRUE)`

```
## >>> Suggestions
## Plot(Jacket, Bike) # bubble plot
## BarChart(Jacket, by=Bike, horiz=TRUE) # horizontal bar chart
## BarChart(Jacket, fill="steelblue") # steelblue bars
##
##
## Joint and Marginal Frequencies
## ------------------------------
##
## Jacket
## Bike Lite Med Thick Sum
## BMW 89 135 194 418
## Honda 283 207 117 607
## Sum 372 342 311 1025
##
##
## Cramer's V: 0.319
##
## Chi-square Test: Chisq = 104.083, df = 2, p-value = 0.000
##
##
## Cell Proportions within Each Column
## -----------------------------------
##
## Jacket
## Bike Lite Med Thick
## BMW 0.239 0.395 0.624
## Honda 0.761 0.605 0.376
## Sum 1.000 1.000 1.000
```

From this visualization we see that 24% of Lite jacket owners are BMW riders, and, in contrast, 62% of the owners of Heavy jackets are BMW riders.