• No results found

Describe What Can Make a Graph

N/A
N/A
Protected

Academic year: 2023

Share "Describe What Can Make a Graph "

Copied!
98
0
0

Loading.... (view fulltext now)

Full text

(1)

Dr. Asiya Chaudhary Asst. Professor (ss), D/O Commerce AMU, Aligarh

(2)

A set of brief descriptive coefficients that summarizes a given data set, which can either be a representation of the entire population or a sample. They are used in the

first instance to get a feel for the data;

Second for use in the statistical tests themselves, and;

Third to indicate the error associated with results and graphical output.

They do not involve generalizing beyond the data at hand. Generalizing from our data to another set of cases is the business of inferential statistics. 

(3)
(4)
(5)

MEASURES OF CENTRAL TENDENCY

These are ways of describing the centr al position of a frequency distribution for a group of data. 


Measures of central tendency include the mean, median and mode.

MEASURES OF SPREAD

These are ways of summarizing a group of data by describing how spread out the scores are.

! Measures of variability include the standard deviation (or variance), the minimum and maximum variables, kurtosis and skewness.

(6)
(7)

Frequency distribution

!

It lists each category of data and the number of occurrence for each category of data.

(8)
(9)
(10)
(11)

We are given the mathematics achievement test scores for a sample of 50 sixth-grade students at a School:

75,48,46,65,71,49,61,51,57,49,84,85,79,85,83,55,6 9,88,89,55,61,72,64,67,60,77,51,61,68,54,63,98,54, 53,71,84,79,75,65,50,41,65,77,71,63,

67,57,63,71,77.

(12)

Test scores Frequency Cumulative frequency

Relative Frequency

Relative Frequency (%)

Cumulative Relative Frequency

40-50 5 5 .10 10% 10%

50-60 10 15 .20 20% 30%

60-70 15 30 .30 30% 60%

70-80 12 42 .24 24% 84%

80-90 7 49 .14 14% 98%

90-100 1 50 .02 2% 100%

total 50 50 1.0 100 100

(13)

A bar graph is constructed by labeling each category of data on either the horizontal or vertical axis and the frequency or relative frequency of the category on the other axis.

Rectangles of equal width are drawn for each category. The height of each rectangle represents the category’s frequency or relative frequency.

(14)
(15)

A two-directional bar chart indicates both the positive and negative values. The following example gives the top 5 cities which have the highest/ lowest recorded tempratures:

(16)
(17)

A Pareto chart is a bar graph whose bars are drawn in decreasing order of frequency or relative

frequency.

(18)
(19)
(20)
(21)
(22)

A component bar chart subdivides the bars in different sections. It is useful when the total of the components is of interest. The following example gives the nutritive values of food.

(23)
(24)

A pie chart is a circle divided into sectors.

Each sector represents a category of data. The area of each sector is proportional to the frequency of the category.

(25)
(26)
(27)
(28)
(29)
(30)
(31)

A stem-and-leaf plot is another way to represent quantitative data graphically. In a stem- and-leaf plot (sometimes called simply a stem plot), we use the digits to the left of the rightmost digit to form the stem. Each rightmost digit forms a leaf. For example, a data value of 147 would have 14 as the stem and 7 as the leaf.

(32)

Step 1: The stem of a data value will consist of the digits to the left of the right-most digit. The leaf of a data value will be the rightmost digit.

!

Step 2: Write the stems in a vertical column in increasing order. Draw a vertical line to the right of the stems.

!Step 3: Write each leaf corresponding to the stems to the right of the vertical line.

!

Step 4: Within each stem, rearrange the leaves in ascending order, title the plot, and provide a legend to indicate what the values represent.

(33)

Example

37, 33, 33, 32, 29, 28, 28, 23, 22, 22, 22, 21, 21, 21, 20, 20, 19, 19, 18, 18, 18, 18, 16, 15, 14, 14, 14, 12, 12, 9, 6!

!!

3 2337!

2 001112223889!

1 2244456888899!

0 69!

! We can make our figure even more revealing by splitting each stem into two parts as follows:!

! !

3 7!

3 233!

2 889!

2 001112223!

1 56888899!

1 22444!

0 69 33

(34)

Back-to-back stem and leaf graph

11 4

3 7

332 3 233

8865 2 889

44331110 2 001112223

987776665 1 56888899

321 1 22444

7 0 69

34

(35)

A dot plot is drawn by placing each observation horizontally in increasing order and placing a dot above the observation each time it is observed.

Though limited in usefulness, dot plots can be used to quickly visualize the data.

(36)

Box plots reflect the scores on a quantitative variable, categorized by a qualitative variable; they are based on the median, quartiles, and extreme values, or they may reflect simply one quantitative variable.

The box represents the interquartile range, which contains the middle 50% of values. It stretches from the the 25th percentile to the 75th percentile.

The whiskers are lines that extend from the box to the highest and lowest values, excluding outliers.

The dark line across the box indicates the median (middle score).

The marks above or below the whiskers are indicators of outliers (scores that are considered to be outside the normal range).

(37)
(38)
(39)
(40)
(41)
(42)

Describe What Can Make a Graph

Misleading or Deceptive?

(43)

1. Title

2. Labels on both axes of a line or bar chart and on all sections of a pie chart

3. Source of the data 4. Key to a pictograph

5. Uniform size of a symbol in a pictograph

6. Scale: Does it start with zero? If not, is there a break shown

7. Scale: Are the numbers equally spaced?

(44)
(45)

Wrong Base Line

Another distortion in bar charts results from setting the baseline to a value other than zero. The baseline is the bottom of the Y- axis, representing the least number of cases that could have occurred in a category.

Normally, but not always, this number should be zero.

45

(46)
(47)

Wrong Line Graphs

It is a serious mistake to use a line graph when the X-axis contains merely qualitative variables. A line graph is essentially a bar graph with the tops of the bars represented by points joined by lines (the rest of the bar is suppressed). Following figure inappropriately shows a line graph of the card game data from Yahoo. The drawback to Figure 7 is that it gives the false impression that the games are naturally ordered in a numerical way when, in fact, they are ordered alphabetically.

47

(48)

48

(49)
(50)
(51)

Misleading Graphs

The viewer's attention is be captured by areas. The areas can exaggerate the size differences between the groups. In terms of percentages, the ratio of soccer participation in the year 1991 to 2006 is 10:14 or 1:1.4. But the ratio of the two areas in the given Figure is about 1: 25. A biased person wishing to hide the fact may use such description.!

Edward Tufte coined the term "lie factor" to refer to the ratio of the size of the effect shown in a graph to the size of the effect shown in the data. He suggests that lie factors greater than 1.05 or less than 0.95 produce unacceptable distortion.

51

(52)
(53)

3D Charts

Don’t get fancy! People sometimes add features to graphs that don’t help to convey their information. !

!

For example, 3-dimensional bar charts such as the one shown in the next Figure are usually not as effective as their two-dimensional counterparts.

53

(54)
(55)
(56)
(57)
(58)

Year 1999 2000 2001 2002 2003 No. of Graduates 140 180 200 210 160

(59)

!!

!!

!!

One may claim that on the basis of this information, we can conclude that men are worse drivers than women. Discuss whether you can reach that conclusion from the pictograph or you need more information. If more information is needed, what would you like to know?

(60)

!!

!!

!!

!!

Which graph could be used to indicate a greater decrease in the price of gasoline? Explain.

(61)

Are Commercial Vehicles in Texas Unsafe?!

Are Commercial Vehicles in Texas Unsafe?

Prerequisite!

Graphing Distributions!

!!

A news report on the safety of commercial vehicles in Texas stated that one out of five commercial vehicles have been pulled off the road in 2012 because they were unsafe. In addition, 12,301 commercial drivers have been banned from the road for safety violations.!

The author presents the bar chart below to provide information about the percentage of fatal crashes involving commercial vehicles in Texas since 2006. The author also quotes:!

Commercial vehicles are responsible for approximately 15 percent of the fatalities in Texas crashes. Those who choose to drive unsafe commercial vehicles or drive a commercial vehicle unsafely pose a serious threat to the motoring public.

61

(62)

62

(63)

!

What do you think?!

!

Based on what you have learned in this session, does this bar chart provide enough information to conclude that unsafe or unsafely driven commercial vehicles pose a serious threat to the motoring public? What might you conclude if 30 percent of all the vehicles on the roads of Texas in 2010 were commercial and accounted for 16 percent of fatal crashes?

63

(64)

This bar chart does not provide enough information to draw such a conclusion because we don’t know, on the average, in a given year what percentage of all vehicles on the road are commercial vehicles. For example, if 30 percent of all the vehicles on the roads of Texas in 2010 are commercial ones and only 16 percent of fatal crashes involved commercial vehicles, then commercial vehicles are safer than non-commercial ones. Note that in this case 70 percent of vehicles are non-commercial and they are responsible for 84 percent of the fatal crashes.

64

(65)

Raw data are first organized into tables. Data are organized by creating classes into which they fall. Qualitative data and discrete data have values that provide clear-cut categories of data. However, with continuous data the categories, called classes, must be created. Typically, the first table created is a frequency distribution, which lists the frequency with which each class of data occurs. Other types of distributions include the relative frequency distribution and the cumulative frequency distribution.

Once data are organized into a table, graphs are created. For data that are qualitative, we can create bar charts and pie charts. For data that are quantitative, we can create histograms, stem and leaf plots, frequency polygons, and ogives. In creating graphs, care must be taken not to draw a graph that misleads or deceives the reader. If a graph’s vertical axis does not begin at zero, the symbol should be used to indicate the gap that exists in the graph.

(66)

Summarizing Data

!Central Tendency (or Groups’ “Middle Values”)

Mean

Median

Mode

Variation (or Summary of Differences Within Groups) !

Range

Inter quartile Range

Variance

Standard Deviation

(67)

!!

Mean

Median

Mode

(68)

Another name for average.

If describing a population, denoted as µ, the greek letter µ, i.e. “mu”. (PARAMETER)

If describing a sample, denoted as , called “x- bar”. (STATISTIC)

Appropriate for describing measurement data.

Seriously affected by unusual values called

“outliers”.

(69)
(70)

The middle value when a variable’s values are ranked in order; the point that divides a

distribution into two equal halves.

!When data are listed in order, the median is the point at which 50% of the cases are above and 50% below it.

!The 50th percentile.

(71)
(72)

Class A--IQs of 13 Students

89 93 97 98 102 106

109 110 115 119 128 131 140

Median = 109

(six cases above, six below)

(73)

!!

If the first student were to drop out of

Class A, there would be a new median:

8993 97 98 102 106 109 110 115 119 128 131 140

Median = 109.5

109 + 110 = 219/2 = 109.5 (six cases above, six below)

(74)

The value that occurs most frequently.

One data set can have many modes.

Appropriate for all types of data, but most

useful for categorical data or discrete data with only a few number of possible values.

(75)
(76)
(77)

1. It may give you the most likely experience rather than the

“typical” or “central” experience.

2. In symmetric distributions, the mean, median, and mode are the same.

3. In skewed data, the mean and median lie further toward the skew than the mode.

Symmetric

Median Mean

Mode Mode

Median Mean

Skwed

(78)

Range

Interquartile range (IQR)

Variance and standard deviation Coefficient of variation (CV)

(79)

The difference between largest and smallest data point.

Highly affected by outliers.

Best for symmetric data with no outliers.

(80)
(81)

The difference between the “third

quartile” (75th percentile) and the “first

quartile” (25th percentile). So, the “middle- half ” of the values.

IQR = Q3-Q1

Works well for skewed data.

(82)

If measuring variance of population, denoted by σ2 (“sigma-squared”).

If measuring variance of sample, denoted by s2 (“s-squared”).

Measures average squared deviation of data points from their mean.

Best for symmetric data.

Problem is units are squared.

(83)

This is nearly (if not for the n-1 in the denominator) the average squared deviation from the sample mean for our observed data.

(84)

Sample standard deviation is square root of sample variance, and so is denoted by s.

Units are the original units.

Measures “average” deviation of data points from their mean.

(85)
(86)

•Ratio of sample standard deviation to sample mean multiplied by 100.


•Measures relative variability, that is, variability relative to the magnitude of the data.

!•Unitless, so good for comparing variation between two groups and for comparing variability of measurements in completely different scales and/or units.

(87)

Measures departure from symmetry and is usually characterized as being left or right skewed as seen previously.

(88)

Pearson’s Skewness Coefficient

If skewness < -.20 severe left skewness


If skewness > +.20 severe right skewness

Fisher’s Measure of Skewness has a complicated formula but most software packages compute it.

!Fisher’s Skewness > 1.00 moderate right skewness
 > 2.00 severe right skewness Fisher’s Skewness < -1.00 moderate left skewness < -2.00 severe right skewness

(89)

Kurtosis measures “peakedness” of a distribution and comes in two forms, platykurtosis and leptokurtosis.

(90)

Measures peakedness of a distribution

Normal distribution has Kurtosis = 0.

Leptokurtotic distributions are more peaked than normal with fatter tails,


Kurtosis > 0

Platykurtotic distributions are less peaked (squashed normal) than normal,


Kurtosis < 0

(91)

Determines whether and to what

degree a relationship exists between two or more quantifiable variables

The degree of the relationship is expressed as a coefficient of

correlation

the presence of a correlation does not indicate a cause-effect relationship

primarily because of the possibility of multiple confounding factors

(92)
(93)

The value of r is such that -1 < r < +1.  The + and – signs are used for positive linear correlations and negative linear correlations, respectively.

Positive correlation:    If x and y have a strong positive linear correlation, r is close to +1.  An r value of exactly +1 indicates a perfect positive fit.      Positive values indicate a relationship between x and y variables such that as values for x increases, values for  y also increase. 

Negative correlation:   If x and y have a strong negative linear correlation, r is close to -1. An r value of exactly -1 indicates a perfect negative fit.      Negative values indicate a relationship between x and y such that as values for x increase, values for y decrease. 

No correlation:  If there is no linear correlation or a weak linear correlation, r is close to 0.  A value near zero means that there is a random, nonlinear relationship between the two variables.

Note that r is a dimensionless quantity; that is, it does not depend on the units employed.

A perfect correlation of ± 1 occurs only when the data points all lie exactly on a straight line.  If r = +1, the slope of this line is positive.  If r = -1, the slope of this line is negative.  

(94)

A correlation greater than 0.8 is generally described as  strong, whereas a correlation less than 0.5 is generally described as  weak.    These values can vary based upon the "type" of data being examined.  A study utilizing scientific data may require a stronger correlation than a study using social science data.  

(95)

The coefficient of determination, 2is useful because it gives the proportion of  the variance (fluctuation) of one variable that is predictable from the other variable. It is a measure that allows us to determine how certain one can be in making predictions from a certain model/graph.

! The coefficient of determination is the ratio of the explained variation to the total variation. The coefficient of determination is such that <   r 2 1,  and denotes the strength of the linear association between x and y.  

The coefficient of determination represents the percent of the data that is the closest to the line of best fit.   For example, if = 0.922, then 2 = 0.850, which means that 85% of the total variation in  can be explained by the linear relationship between x and (as described by the regression equation).  The other 15% of the total variation in y remains unexplained.

(96)

!

A regression is a statistical analysis assessing the association between two variables. It is used to find the relationship between two variables X

& Y.

(97)

Regression Equation (y) = a + bx 


Slope (b) = (NΣXY - (ΣX)(ΣY)) / (NΣX2 - (ΣX)2)


Intercept (a) = (ΣY - b(ΣX)) / N


where 


      x and y are the variables.


      b = The slope of the regression line 


      a = The intercept point of the regression line and the y axis. 


      N = Number of values or elements 
       X = First Score


      Y = Second Score


      ΣXY = Sum of the product of first and Second Scores
       ΣX = Sum of First Scores


      ΣY = Sum of Second Scores


      ΣX2 = Sum of square First Scores

(98)

Thank You

References

Related documents

The graph representing the data is drawn separately in (Text-fig. Thus from the length frequency method, we can recognise four year classes, viz.. be seen by

These gains in crop production are unprecedented which is why 5 million small farmers in India in 2008 elected to plant 7.6 million hectares of Bt cotton which

Angola Benin Burkina Faso Burundi Central African Republic Chad Comoros Democratic Republic of the Congo Djibouti Eritrea Ethiopia Gambia Guinea Guinea-Bissau Haiti Lesotho

In this paper we investigate the 3- remainder cordial labeling behavior of the Web graph, Umbrella graph, Dragon graph, Butterfly graph, etc,.. Keywords: Web graph, Umbrella

Bar-graph analysis of circular category pull handles for style/ looks (C) According to the survey the 4 th design is best for ease of use and comfort voted by users but 1 st design

Table 3: Graph between Rate of accident vs Horizontal curve Table 4: Graph between Rate of accident vs Vertical grade Table 5: Graph between Rate of accident vs Superelevation

Contract Labour Unions and the Labour Supply Contractors’ Association with Oil India Limited as the witness in the presence of the Regional Labour Commissioner

Daystar Downloaded from www.worldscientific.com by INDIAN INSTITUTE OF ASTROPHYSICS BANGALORE on 02/02/21.. Re-use and distribution is strictly not permitted, except for Open