Wednesday, October 4, 2017

Assignment 2
Goals of Assignment 2
Increase Familiarity with Definitions of Descriptive Statistics
Increase Familiarity with Statistical Methods
Increase Familiarity with Computer Programs

Part 1
Methods: Range, mean, median, mode, kurtosis, skewness, and standard deviation were defined by the author, in his own words. Data supplied by Dr. Ryan Weichelt was used to hand calculate a standard deviation for two sets of data: the first standard deviation was calculated using a sample of student's standardized test scores from Eau Claire North High School and the other standard deviation was calculated using a sample of student's standardized test scores from Eau Claire Memorial. Microsoft Excel was then used to calculate the range, mean, median, mode, kurtosis, skewness, and standard deviation for the Eau Claire North and Eau Claire Memorial data samples. These results were then used to answer the following question; "Should Eau Claire North teachers worry about not having the highest test grade?"

Definitions

Range: The range is the difference between the highest and lowest scores; max score - low score.

Mean: The mean is the calculated average of all observations, found by adding all observations together then dividing by the total number of observations.

Median: The median is the midpoint of the data set; half of the observations fall above and half fall below this point. If the total number of observations is odd, it is simply the middle number in the data set and, if the total number of observations are even, it is the average of the two middle observations.

Mode: The mode is the most frequently occurring data point found in a set of observations.

Kurtosis: Kurtosis (Fig. 1) is the shape of the distribution curve and can be mesokurtic (normal), leptokurtic (peaked), or platykurtic (flat). A value greater than 1 is considered leptokurtic while a value less than -1 is considered platykurtic. Curves that are lepto- or platykurtic have this shape as 68% of the observations one standard deviation above and below the mean have to fit in that area (Fig. 2). Leptokurtic distributions have smaller standard deviations as there are more observations near the mean while platykurtic distributions have larger standard deviation values.


Figure 1: Kurtosis
(http://grants.hhp.coe.uh.edu/doconnor/PEP6305/KurtosisPict.jpg)

Figure 2: Normal Distribution
(http://img.tfd.com/dorland/distribution_normal.jpg)


Skewness: Skewness (Fig. 3) is the measure of the asymmetry of a distribution curve due to a higher-or lower-than-expected number of observations that fall into the positive or negative ends of the curve; if  positive there is an extended tail to the positive side (due to more observations that are lower-than-expected) and, if negative (due to more observations that are higher-than-expected), an extended tail to the negative side. A value below 1 and above -1 is considered normal.




Figure 3:Skewness
(https://www.isobudgets.com/wp-content/uploads/2015/10/skewness.jpg)

Standard Deviation: The standard deviation (Fig. 4) is the distribution/distance of scores about/from the mean, found by subtracting the mean from each observation, squaring the result, then finding the square root of the sum of these results after dividing by N or n-1. In this assignment n-1 was used as this is a sample population standard deviation.


Figure 4: Sample and Population Standard Deviation Equations
(http://dsearls.org/courses/M120Concepts/ClassNotes/Statistics/StandardDeviation2a.gif)

Hand Calculation


Eau Claire North Test Score Data Hand Calculation



Figure 5: Eau Claire North Data

Eau Claire Memorial Test Score Data Hand Calculation





Figure 6: Eau Claire Memorial Data

Results

Eau Claire North Test Scores


Eau Claire Memorial Test Scores



This data suggests that teachers at Eau Claire North have no reason to worry about being fired because their students have lower test scores than students at Eau Claire Memorial. The four  descriptive statistics that are best for determining this are the range, median, mean and kurtosis. I chose these for the simple fact that they portray score differences in a manner that is easy to explain to people with little knowledge of descriptive statistics. EC North students have a lower range (83 compared to 91) which shows us that their test scores are less widely dispersed than the EC memorial students scores were; i.e. the lowest score attained by a student at North (111) was closer to the highest score attained by another student at North (194) than the difference between the lowest (107) and  highest (198) scores attained at Memorial. This may suggest that the North students taking the test were better prepared at all levels of ability as compared to the students at Memorial. The median was higher at North (164.5) than at Memorial (159.5). This tells us that half of the students at North who took this exam in this sample had a score higher than 164.5 and half had a score lower than 164.5, compared to Memorial's median of 159.5. The mean test scores in this sample show that North, with a 160.92 average, had a higher mean test score than Memorial, at 158.54, did. Kurtosis refers to the shape of the distribution; the shape of the distribution curve based on the sample of test scores at North (-.56) is between -1 and 1 and is normal. The shape of the distribution curve based on the test score samples from Memorial is less than -1 (-1.17) and is considered platykurtic. This flatness of the curve is due a larger standard deviation value (27.16) and reflects the extreme disbursement of scores from the mean; we have less scores near the mean score of (158.54) than we observe based on the sample from North (160.92), i.e. the scores at north were closer to the mean. The students at North, on average, performed better on this standardized test. These statistics show that the teachers and students at North are doing quite well and public perceptions are not only unjustified but probably incorrect. This analysis was done using a sample of test scores from both schools using descriptive statistics; we are describing what we see and can make assumptions about this data but we cannot make any inferences regarding it. 

Part 2
Methods: The shapefile used in Assignment 1 was added to a blank map in ArcGIS and a Microsoft Excel spreadsheet, provided by Dr. Ryan Weichelt containing Wisconsin population data by county, was joined to the shapefile based on countyGeo_id. Three spatial statistic analyses were performed: a geographic mean center of Wisconsin (Toolbox/Spatial Statistics/Measuring Geographic Distributions/Mean Center), a weighted mean center of population using 2000 Wisconsin county population data (Toolbox/Spatial Statistics/Measuring Geographic Distributions/Mean Center/Weight/2000), and finally a weighted mean center of population using 2015 Wisconsin county population data (Toolbox/Spatial Statistics/Measuring Geographic Distributions/Mean Center/Weight/2105). 


Mean Center: The mean center is a spatial measurement of central tendency; in the map below it is represented by the red point which represents the geographical center of the state of Wisconsin found by averaging X and Y values (latitude and longitude).
Weighted Mean Center: Weighted mean centers are concerned with the frequencies in data sets; in the map below mean centers were weighted by population. More populous areas (counties) have a higher number of people and therefore have a heavier weight which pulls the geographic mean center toward the direction of more populous counties.


Results: The green point represents the weighted mean center of population in 2000, and the blue, of 2015. Both points are pulled to the southeast of the geographic center of Wisconsin, which is the red point, as the southern and eastern regions of Wisconsin have a higher population than the northern and western parts of the state. This is due to high populations in cities such as Milwaukee (southeast corner of the state), Madison (south-central), and Green Bay (east-central), for example. We observe, however, that the weighted mean center of population in 2015 has been pulled slightly to the southwest of the weighted mean center of population observed in 2000. An explanation for this is slight increase in county populations in the southern and western parts of the state that is greater than the increase in the northern and eastern counties in the state.

Sunday, September 24, 2017

Assignment 1
Goals of Assignment 1
Differentiate Between Levels of Measurement
Differentiate Between Classification Methods
Retrieving Data for the US Census and Joining Data
Enhance Cartographic Knowledge

Part I

Nominal Data: Nominal data is data that is categorized (into one of two or more categories) by membership label and has no inherent value associated with it; the unit assignment is categorical only. In Figure 1, tree climate zones are portrayed; the climate zones have no quantity associated with them and are simply regional labels.



Figure 1: Nominal Data
 (printable-maps.blogspot.com)

Ordinal Data: Ordinal data is data that is placed in a rank order, such as what is found in Likert Scales (on a range scale of 1-10 very dissatisfied to highly satisfied) or Moh’s Scale of Hardness, which is an ordering of scratch resistances of minerals from 0 (least) to 10 (most). The ordering of the data is what matters here; differences between values are not important or even knowable. This idea is exemplified in Figure 1 that I have chosen to use here. The differences in happiness levels are not known but the rank order of the data is.



Figure 2: Ordinal Data
(http://www.huffingtonpost.com/2013/08/02/happiest-states-_n_3696160.html)

Interval Data: Interval data is data that has a known order and values between the data points but has no true zero value origin making it impossible to calculate ratios. We know exact differences between the data points; for example, 71 minus 64 is 8 degrees. There is no such thing as a zero temperature on the Fahrenheit scale and 0 is simply a reference point. Figure 3 is a map of temperatures across the United States which is a common way to portary interval data. 



Figure 3: Interval Data
(https://weather.com/maps/ustemperaturemap)

Ratio Data: Ratio data is data that has an order, measurable differences between data points, and a true origin of zero which allows for interpretation and the knowing of true quantities. Figure 4 represents the temperature across the Earth's surface in degrees Kelvin. The Kelvin scale has a true zero point which is the point at which all molecular motion stops.



Figure 4: Ratio Data
(climate.nasa.gov)

Part II

Methods: An Excel sheet with all Wisconsin county Geo_IDs was provided by Dr Ryan Weichelt and the number of organic certified farms by county was then entered, using information located  at https://www.agcensus.usda.gov/Publications/2012/Full_Report/Volume_1,_Chapter_2_County_Level/Wisconsin/st55_2_042_042.pdf. This contained the relevant data (number of certified organic farms) gathered by the 2012 Census of Agriculture. A Wisconsin counties shapefile was then downloaded from the United States Census Burea located at 
http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml after performing the following selections from the home page: advanced search, data selection (2010 SF1 100% Data), geographies, county as geographic type, Wisconsin as state, then shapefile download in the map mode . The shapefile was then connected to ArcGIS and added to a blank map. The Excel chart was then joined to the ArcGIS map based on the Geo_ID codes. Three maps were made using three different data classification methods (explained in detail below) presenting identical data; each classification method utilized four classes. 


Equal Interval Classification Method: Figure 1 represents the data displayed on a map using the equal interval classification method. The equal interval classification method divides the range (max observation-minimum observation) of observed values into a predetermined number of classes of equal size; the problem in this case is that 70 counties fall into the first class (0.0-58.3), with only one each in the second (58.4-116.5) and fourth (174.9-233) classes and none in the third (116.6-174.8) class. Using the equal interval classification method is not an effective way to portray this data; you cannot propose concentrating on any of the 70 counties with any confidence when you  have such disparate values in groups as seen in the following example. Douglas County has 5 organic farms while Clark County has 49 organic farms. Both of these counties appear in the first group, which is where efforts should be concentrated, but Douglas County seems to be the better choice over Clark County as it has a far fewer organic farms. That is impossible to determine with the data mapped in this manner.


Figure 1: Mapped data using the equal interval classification method.


Natural Breaks Classification Method: Figure 2 represents the data displayed on a map using the Natural Breaks classification method. The Natural Breaks method seeks the minimization of variance within classes and the maximization of variance between classes and assigns the data accordingly; i.e., it classifies data that are closest in values into four classes (in this case) based on breaks (gaps) in the data. This can lead to classes that contain widely varying number ranges. This classification resulted in a smaller range of values in the first three classes but still does not present the data in a manner that will allow for a good business decision to be made. Maximum effort should be made in those counties that have the fewest certified organic farms, but a majority of the counties in Wisconsin still fall into the first class and the Natural Breaks method isn't sensitive enough to portray differences at the lower end of the data range. There is an interesting spatial relationship beginning to become apparent here; more on this in the next method description.



Figure 2: Mapped data using the Natural Breaks classification method.


Quantile Classification Method: Figure 3 represents the data displayed on a map using the quantile classification method. In the quantile classification method an equal number of features are placed in each group, independent of quantities of farms. In this case, the 72 counties (features) of Wisconsin were divided by four (the number of classes used for this project), which resulted in the placement of 16 counties into each class. This method is much more sensitive to differences at the lower end of the values than the other two classification methods used previously, making it extremely valuable in making an informed decision as where to efforts to promote the startup of organic farms should be made. The first two data groups are comprised of the 32 counties where there are fewer than 10 organic farms; ideal areas to promote the message of increasing organic farms. The spatial pattern that began in Figure 1, and became readily obvious by Figure 3, is that the southwestern, southern, and central regions of Wisconsin have many more organic farms than the surrounding areas, generally speaking. Efforts should be focused on the areas surrounding southwestern Wisconsin.



.Figure 3: Figure represents the data using the quantile classification method.

Results: I have already determined, based on the explanations preceding each figure, that the quantile classification method works best for this project. Both the equal interval and Natural Breaks methods were sensitive to the Vernon County outlier (value of 233) which affected the ability of these classification methods to portray the data in an easily discernible manner. This is, however, an initial step and further studies would be needed, such as demand for organic foods, population densities, soil quality, aquifer stability, and the available labor (organic farming being labor-intensive) in each county. Despite the use of the same data for each map, the widely varying results were entirely dependent upon the data classification method used. The use of a larger number of classes could possibly have had an effect on the method I chose that offered the best presentation of the data.