## Data Analysis

The nature of environmental monitoring data presents a variety of challenges for statistical analysis. Foremost among these is log-normal distribution of the data. In other words, the distribution is characterized by a large number of values near zero, with some values that are much higher than the rest. Because these high values are typically of significant interest and utility, they were retained in the dataset. Such extreme values can introduce a bias into the results of some statistical procedures, but the median (the midpoint of an ordered list of values, with 50 percent below and 50 percent above) is a relatively unbiased of central tendency, and was employed in some of trend analyses. In addition, transformation of the data by conversion to a logarithm can offset the bias to a degree.

Censored data (non-detects) pose another difficulty. Environmental data analysts typically convert the censored value to some multiple of the reporting limit (the numeric portion of the "<" statement). Treating the value as ½ the reporting limit is most common. However, if the dataset includes values with differing reporting limits, due to changes in methodology for example, a bias can be introduced in trend analysis, and a trend may be inferred that is not related to changes in the environment. Sophisticated statistical techniques exist for imputation of missing or censored values, but this was beyond the scope of the present analysis.

H-GAC staff performed several types of statistical analysis of the dataset, which can be classified as either descriptive (summaries of the dataset at the watershed or station level) or inferential (trends analysis). Additional information on basic statistical concepts and definitions of key statistical terms may be found in the statistical appendix. In addition, most of the tables and graphs discussed below can be found in the statistical appendix.

Basic descriptive statistics were assembled for each station (including historical stations) for each parameter. The summary table includes the range of sampling dates, number of samples, number of exceedances, percent exceedance, minimum value, maximum value, mean value, median value, standard deviation, r-square, t-ratio, p-value, 25 percent quartile and 75 percent quartile. It should be noted that the r-square, t-ratio and p-values in this table may be biased by extreme values and cannot be considered independent indications of significant trends. The statistics in the summary tables apply to the dataset compiled for the Basin Summary Report analysis and not for all results that exist in SWQMIS.

Box plots were created for each watershed and parameter showing the annual distribution of data in a graphical format, in addition to summary descriptive statistics for each year.

The percentage of samples exceeding either the 2010 water quality standards or nutrient screening levels was graphed by year. Seven linear regression analyses were performed to identify statistically significant trends at both the watershed and station level. Regression statistics and graphs of trends may be found in the statistical appendix.

The median of each parameter was regressed on the year at the watershed level.

The median of each parameter was regressed on the year at the monitoring station level.

The natural logarithm of the value of each parameter was regressed on the collection date at the watershed level.

The natural logarithm of the value of each parameter was regressed on the collection date at the station level.

The median value of each parameter was regressed on the year for a restricted dataset consisting of stations containing at least 30 data points over 10 years for each parameter.

The natural logarithm of the value of each parameter was regressed on the collection date at the station level for a restricted dataset consisting of stations containing at least 30 data points over 10 years for each parameter.

The percent exceedance of the water quality standard or nutrient screening level was regressed on the year at the watershed level.