< Previous |TOC | Next >

Robert Allison PhD Dissertation - Chapter 6 - RESULTS


The purpose of the results section is two-fold -- to demonstrate all the basic capabilities of TABIS, and to present several examples of unique TABIS output.

The examples in this Section are not a comprehensive list of all the analyses which could be performed with TABIS, or even all the combinations of analyses which are built into the menu system. TABIS is so flexible that an "endless" combination of choices could be made from the menus alone, not to mention all of the customizations which users could make by modifying the code or adding customized extensions.

6.1 Integrating Data from Disparate Sources to Improve the Quality and Forecastability

One of the most important characteristics of a data warehouse is that the data from disparate sources are stored in such a way that they can be integrated. This section provides several examples of different ways data can be integrated in TABIS.

In the first three examples (Sections 6.1.1, 6.1.2, and 6.1.3), the data from disparate data sets are integrated mathematically -- monetary values are converted to constant dollar, sales are converted to per capita, and population is converted to per square mile values. In the last example (Section 6.1.4), the data from two sources are integrated graphically.

6.1.1 Using Deflators to Convert Current Dollar Values to Constant Dollar Values

Although monetary values are commonly deflated by calculating log ut, there is no guarantee that the logarithmic transformation will produce good constant dollar values. [Bat, p.155] [ETS, p.27] Instead, deflators which are specially calculated for individual items should be used.

The Department of Labor, Bureau of Labor Statistics (BLS) Consumer Price Index (CPI) publishes several deflators which can be used to convert current dollar values into constant dollar values. [Kumar, p.38] Average (1982, 1983, 1984) indexes are available for about 35 consumer items and groups of items. The following list shows some of the apparel items for which a BLS CPI deflator is available, along with the id number used by BLS CPI (Figure 6.1).

The code in Figure 6.2 demonstrates how to mathematically integrate BLS CPI deflators with the Sales and Marketing Management (SMM) magazine's data on Women's and Girls' apparel sales. First, the deflator for women's and girls' apparel is selected from the BLS CPI data into table A. Next, the women's and girls' apparel sales are selected from the SMM data into table B. Then data sets A and B are combined, and the constant dollar values are calculated in table C.

The TABIS menus generated the code for tables A and B, and a user-written extension was used to join tables A and B by year, producing table C (Table 6.1).

The purpose of this example is to show that data can be easily integrated between data sets in TABIS. This technique for converting current dollar values into constant dollar values is a good starting point for several other analyses, such as plotting the data, or performing time series analyses.

6.1.2 Using Population Data to Calculate Per Capita Expenditures

Using a similar technique as in the previous example, the Population estimates from the U.S. Census can be integrated with sales data to produce per capita sales.

First TABIS's "ASCII Table" option was used to generate the code to produce table A (see Figure 6.3), containing women's and girls' apparel expenditures (i.e., l=134) in current dollars (i.e., t=201) from the National Income and Product Accounts (NIPA) Personal Consumption Expenditures (PCE) data set. TABIS uses the meta data to determine what values are available in the data, and presents this to the user in plain English, with simple choices via the ASCII menu -- TABIS then generates the SAS code with the l=134 and t=201 for the user.

Similarly, TABIS was used to generate the code to select the female population into table B (see Figure 6.3), saving the user much time and chance for error by generating the code to sum all the individual ages. These two queries would be difficult to write from scratch without the benefit of the meta data (i.e., data about the data) that is programmed into TABIS -- without the TABIS menus, the users would need to know the data set names, the variable names, the monetary units used, and the "codes" used to represent age, race, sex, etc.

Table C is then generated with a user-written extension that joins tables A and B by year, and divides the sales by population to calculate per capita sales (see Figure 6.3).

The output (Table 6.2) shows that although the total sales increased each year, the per capita sales actually decreased in 1989.

Further analysis would be possible by converting the per capita values into constant dollar amounts, using the method described in Section 6.1.1. With each additional analysis, a more complete picture of the data could be obtained. Since the purpose of this example is to show that the technique is possible in TABIS, not to provide a complete analysis of women's and girls' expenditures, additional improvements are not made to the data at this time.

6.1.3 Comparing State Populations Using Total and Per Square Mile Values

In addition to constant dollar and per capita conversions, data can also be integrated to produce per square mile values. This example demonstrates how to integrate the Census Population estimates for "retired" people (people over 65) with state land area data to calculate retired people per square mile (ie, retired population density.)

The first step is to have TABIS generate code to sum all the population aged 65 and over, by state, for the year 1993. The code in (Figure 6.4) queries one of the Census data sets, and saves the results in table A (Table 6.3).

As the data set is sorted in descending order, the first 10 observations in the list are the 10 states with the largest retired population (see Table 6.3).

Next, table B is generated containing the square miles in each state, and then table C is created by joining tables A and B by state, and dividing the population (age65_) by the square miles (sqmi) to produce population, aged 65 and older, per square mile (per_sqmi). The data set is again sorted in descending order so the first ten observations represent the ten states with the largest retired population per square mile (see Figure 6.5).

The results are shown in Table 6.4. Notice that this table provides a much different picture than the "raw" totals (Table 6.3). Both views of the data are valuable, and together they provide a more complete picture of the data. TABIS provides the framework that allows such alternate views of the data to be easily produced.

6.1.4 Integrating Data by Overlaying Plots

In addition to mathematically integrating the disparate data sets, TABIS can also be used to graphically integrate the data. The following is a simple example showing how female clothing sales data from two data sets can be graphically integrated by plotting them together.

First, the desired data are selected from the National Income and Product Account (NIPA) Personal Consumption Expenditures (PCE), and the Sales and Marketing Management (SMM) data sets, and stored in tables A and B respectively. Notice that the monetary values are converted to billion dollars, and mnemonic variable names are assigned. Tables A and B are then combined into table C. (see Figure 6.6)

The SAS System allows several variables from a data set to be plotted together, with different plot symbols (and/or colors) for each variable. In this case, symbol1 is used to print a dot at the vertices along the SMM line, while symbol2 is used to print diamonds at the vertices of the NIPA line. The "overlay" statement is used to force both lines to be printed on the same page. (see Figure 6.7)

Figure 6.8 shows that the female clothing sales reported in these two data sources is generally very close, except in the last year. Comparing the same data (or very similar data) from several different sources is a good practice -- if large differences are found, they should be investigated.

Figure 6.8 Sample Plot Integrating Two Data Sets

6.2 Summarizing Large Data Sets with High-Resolution Graphics

One of the advantages of analyzing data with modern high-capacity, high-resolution computer equipment is that huge data sets can now be summarized graphically. In the past, large data sets could only be analyzed mathematically using methods such as summary tables and summary statistics, and sometimes crude graphics such as low-resolution plots. Unlike the computers and printers commonly used 20 or 30 years ago, almost all of today's desktop computers can produce high-resolution graphics on both their screens and printers.

The high-resolution graphics allow the data to be visualized using endless combinations of plots, maps, symbols, colors, patterns, shapes and sizes.

This section describes several methods developed to visualize the large data sets in TABIS.

6.2.1 Projected Changes in the Age Distribution of the U.S. Population

The population projections from the Census are among the largest datasets in the TABIS data warehouse. The Census provides population projections by individual age, race, sex, state and year. This fine granularity allows maximum flexibility for how the data can be used, but a considerable amount of work must be done to select, combine, summarize, and present the data in a form that answers questions like "how is the age distribution of the population for 2015 projected to differ from that of 1995?"

Population tree plots (aka population pyramids) which group the population into age groups consisting of 5 years each and display them as horizontal bar charts, were developed to visualize such shifts in the age distribution. Visualizing the data with population tree plots, one can easily plot any age, race, sex, state and year on the left side of the tree, and compare it to any other combination on the right side of the tree. Most often the tree has only one variable different on the left and right sides, such as sex, to make the two sides easy to compare.

Since there were no built-in functions in SAS to produce tree plots, considerable research was performed to develop the needed computer code. The TABIS code developed to produce such plots is described in detail in the 1994 SouthEast SAS Users Group (SESUG) conference proceedings, pp. 91-93. [SESUG]

The TABIS menu program allows users to easily select the values they want on each side of the tree plot, and automatically generates the SAS code needed to produce the specialized plot.

The following plot takes the tree plot a step further, and overlays the outline of one tree plot onto the main (colored/filled) tree plot. (Figure 6.9) shows both the 1995 population distribution (solid/filled) and the projected 2015 population distribution (outline) together for easy comparison. The plot summarizes 400 data points, and illustrates that the main 10-year "baby boomer" bulge (aged 30-39 in the 1995 plot) will shift to the 50-59 age group in the year 2015, greatly increasing the population in that age group. It also shows that the U.S. population is becoming older, with the tree becoming shaped more like a chimney instead of a pyramid. This shift in the age distribution will quite likely cause changes in apparel demand in the U.S. in areas such as style, quality, quantity, and price.

Figure 6.9 Sample Population Tree Plot

6.2.2 Projected Changes in the Geographical Distribution of the U.S. Population

In addition to visualizing the shifting age of the U.S. population, as in the previous example, TABIS can also be used to visualize the geographical population shifts. NPA Data Services provides data such as population projections for particular age groups at the U.S. county level. This data is difficult to grasp in tabular form, or as bar graphs on a plot, since there are over 3,000 U.S. counties. Instead, a high-resolution map of the U.S. by county was used as a framework to help visualize this data graphically by shading the counties based on the population values.

Figure 6.10 shows the shift in population aged 20-64 between the years 1995 and 2015. This age group is important because it shows the population which provides the potential labor force, as well as a strong consumer group. The map shows that many counties in Florida, California, and Texas are projected to experience over 20% increases, as are many counties in several other states.

Figure 6.10 U.S. Map of Projected Population Shift for Age 20-64

The next map (Figure 6.11) shows the projected shift in the "retired" population (aged 65 and over) between 1995 and 2015. This age group is projected to grow tremendously as the baby boomers age. Such a large increase could drastically change the demand for products such as apparel, food, automobiles, and housing. Companies with the foresight to use this type of information could capitalize on an increasing market for the goods and serviced demanded by this age group.

Notice that there are some major differences between the age 20-64 population shifts and the age 65+ population shifts. For example, almost all counties in North Carolina are forecast to have an increase of over 20% in population aged 65+, whereas only certain counties are projected to experience an increase of over 20% for the 20-64 age group.

Also, some areas are actually projected to have a decrease in their retired population, particularly in the central U.S.

Figure 6.11 U.S. Map of Projected Population Shift for Age 65+

6.2.3 Geographical Distribution of the U.S. Textile and Apparel Manufacturing Employment by County

The Bureau of Labor Statistics (BLS) Employment and Earnings (EE) is another very useful data set contained in the TABIS data warehouse. The examples in this section demonstrate how TABIS can be utilized to geographically visualize the distribution of textile and apparel employment in the U.S.

Unlike the previous example which used built-in SAS "choropleth maps" to represent the percent change in population as three different colors/shades, the examples in this section represent values as dots, where the areas of the dots are proportional to the data values. These "dot maps" have several advantages over the choropleth maps: 1) they represent each value on a continuous size scale, instead of grouping the values into discrete color ranges, 2) the dots are independent of the size of the county, and therefore show less area size bias, and 3) the dot maps are more efficient to compute and print.

Since there were no built-in functions in SAS to produce "dot maps," techniques were developed as part of the TABIS research. For a detailed description of how the code works, see "Dot Maps Made Simple" in the 1995 SAS Users Group International (SUGI) conference proceedings, pp. 839-844. [SUGI95]

The first two examples presented here show the "raw" textile and apparel manufacturing employment values for each county represented as dots. These maps were generated directly from the TABIS menu, with no user-written extensions or modification required.

After the desired data are selected, the radii of the dots are calculated, and the center location of each county is estimated. The dots are then printed on a U.S. county map using a SAS annotate data set, and the state outlines are also overlaid on the map as an annotate data set (see code in Figure 6.12).

The first example (Figure 6.13) shows the textile manufacturing employment (SIC 2200) to be heavily concentrated in North and South Carolina, with a few smaller areas of high concentration which are mostly located in the eastern U.S.

Figure 6.13 Textile Employment Dot Map

The second example (Figure 6.14) shows apparel manufacturing employment (SIC 2300). Notice that there are many dots in the southeast U.S., but that these dots are all scaled very small because of the very large dots in Los Angeles county, which had 100,010 apparel manufacturing employees, followed by New York county New York county with 53,174. Although the large values of those two counties are important, they are somewhat biased by the large populations of those two counties.

Figure 6.14 Apparel Employment Dot Map

It would be possible to merely discard the very large values as "outliers" so that the other dots could be re-scaled, and studied more closely, but a different technique was used that allows all of the values to be compared, without discarding any data.

The technique involved combining the employment data with the county population data, and calculating what percentage of the "working" population (aged 20-64) the apparel manufacturing employment represents. This dramatically shows the importance of apparel manufacturing in the southeast U.S. (see Figure 6.15)

Figure 6.15 Dot Map Showing Importance of Apparel Employment

6.2.4 Apparel Consumption Profiles by State

The Sales and Marketing Management Magazine (SMM) published annual estimates of apparel expenditures in all stores and in department stores, by state, sex and year. Since it is difficult to comprehend and compare several variables between 50 states, techniques were developed to help visualize the data graphically, allowing the user to compare large amounts of data at a glance.

The dot map technique demonstrated in the previous example was extended to allow visualizing expenditures of both males and females on a single dot by dividing it into two slices (one slice for male and one for female), ala a "pie chart" for each of the 50 states. Although SAS contains built-in code for producing pie charts, producing pie charts that can be overlaid with U.S. maps by state requires the use of quite sophisticated coding techniques. The techniques are described, in detail, in Example 3 of the SUGI paper "Dot Maps Made Simple" [SUGI95].

The sample code in Figure 6.16 demonstrates how the annotate data set containing the pie charts is created. First, black pie slices representing the female expenditures are generated, then white pieces representing the male expenditures are generated, and then black circles are drawn around the pie charts to separate the white pie slices from the white background (I use colored slices in this web version of my dissertation).

The following figures show the male and female apparel expenditures in all stores (Figure 6.17) and department stores (Figure 6.18). These maps indicate that the proportion of female expenditures in all stores are basically the same in all the states, whereas the female expenditures in department stores differ from state to state. Also, the size of the dots show which states have large or small apparel markets.

Figure 6.17 Clothing Sales in All Stores Pie Chart Map

Figure 6.18 Clothing Sales in Department Stores Pie Chart Map

6.2.5 Apparel Consumption Profiles by Income Level Using Three Dimensional Plots

Three dimensional visualization is another valuable data analysis technique made possible by high-resolution graphics. Three dimensional (3D) plots, for example, allow two variables to be plotted against a third variable, producing a plot that allows the values to be easily analyzed in two dimensions. 3D plots provide an intuitive means to easily compare the values of the variables, as well as their interactions.

Similar comparisons can also be made using two dimensional (2D) plotting techniques, such as grouped bar charts, or overlaid line plots. With the 2D plotting techniques, it is easier to make more precise comparisons, but usually at the expense of the range of values that can be analyzed at the same time.

In this example, some of the Bureau of Labor Statistics (BLS) Consumer Expenditures (CE) data are visualized using 3D plots. First the average U.S. household expenditures on apparel and services per household are plotted by year and by household income level (Figure 6.21). Next, the total U.S. expenditures on apparel and services per household are plotted by year and by household income level (Figure 6.22).

Several transformations were applied to the data to transform it into a format suitable for 3D plotting. For example, some of the income ranges reported by BLS/CE were combined, using a weighted average technique, so each of the ranges would be a $10,000 increment (see Figure 6.19).

Also, the data set was transposed and sorted so that it would be in a format that could be plotted using SAS's "g3grid" and "g3d" procedures. Additionally, the axes and labels were fully customized to create a plot that is more easily readable, instead of using the defaults (see Figure 6.20).

The first plot (Figure 6.21) confirms what intuition would dictate - the average household expenditure on apparel and services increases as the income level increases, and the distribution between the income levels has held fairly steady from year to year. The second plot (Figure 6.22) adds more insight by plotting the total expenditure on apparel and services by household income level, and shows that the households with an income of over $50,000 spend significantly more than the other groups.

The BLS/CE data set contains many other variables for which similar 3D plots could yield additional insight into the consumer groups and trends. Many such options are available through the TABIS menus.

Figure 6.21 Average Household Expenditure on Apparel and Services by Income Level and Year

Figure 6.22 Total U.S. Expenditure on Apparel and Services by Income Level and Year

6.3 Using Animations to Analyze the Dynamic Nature of Time Series Data

Many of the data sets in TABIS contain time series data, and methods were developed in TABIS to visually analyze this data. Often, simple plots of the data over time were adequate, but in cases where several variables needed to be analyzed together, a single, static plot was not adequate.

To accommodate the need to analyze the dynamic nature of the data, animations were developed. To produce these animations, plots of the desired variables were created for each desired year, and programs such as "XAnim" were used to display the plots one after another, producing an animation. The XAnim software allows the user to stop and start the animations at will, and to reverse the order so that the animation goes either forward or backward in time. (The example in this web version of my dissertation uses a web browser to play the animation, rather than Xanim.)

These animations allow the users to easily detect where the data changes, and stays the same, over time.

6.3.1 Animating Monthly Apparel Consumption to Detect Trends and Outliers

The Monthly Retail Sales and Inventories (MRSI) data set is an example of a data source with several variables (i.e., twelve monthly values) per year. Plots of several years of MRSI data were generated and animated, to show if there were any changes in the consumer patterns over the years.

Figure 6.23 shows one such animation. From this example, it is evident that the "seasonal" apparel consumption trends held fairly constant over time. December always has, by far, the largest percentage of the apparel sales, and is always followed by low apparel sales in January and February. There is also a slight peak in August, which probably represents back-to-school apparel expenditures.

Although the large seasonal trends are clearly visible without any analysis tools, the animation allows smaller trends and differences to be detected. For example, the animation indicates that the percentage of expenditures made in December have gradually declined, from 14.5% in 1975 to 13.3% in 1990. Also, the values for August seem to have increased slightly, from 8.5% in 1975 to 9.0% in 1990.

The bars for the other months show slight fluctuations, both up and down, and the animation of these changes resembles pistons in an engine -- no definite trends, but nonetheless an interesting animation which could prove useful in detecting the small changes.

Figure 6.23 Monthly Sales Animation

['refresh' browser to restart animation]

6.3.2 Animating the U.S. Population Projections to Visualize Shifts in Age Distribution

The population projections from the U.S. Census are some of the most often used data in TABIS. Many of the analyses of other data sets in TABIS depend on the population projections, and the projected shift in age distribution as the "Baby Boomers" grow older will certainly affect apparel demand.

Therefore, animations were developed to show how the "Baby Boomer Bulge" will grow older. Several animations were tried, but the most useful one used a variation of the "population tree plots" described in Section 6.2.1, and the "XAnim" animation software (see Figure 6.24). The animation shows a sequence of plots spanning the years 1990-2020, and uses an outline/shadow of the previous year for easy comparison of data between the years.

In addition to showing the increasing age of the baby boomer bulge, the animation also shows that the number of births is projected to level off, or even decrease, which will dramatically change the age distribution of the U.S. population.

Figure 6.24 Population Animation

['refresh' browser to restart animation]

6.4 Programmatically Trying Forecasting Models

One of the major advantages of TABIS over other data analysis systems is that the data can be accessed programmatically, using the rich programming language which is built into the SAS System. This allows users to easily use built-in forecasting procedures or write customized forecasting code, and apply it to multiple groups of data using programming loops and "by statements" in the SAS code. The code can easily be re-used to try different forecasting techniques on the same, or similar, data.

Several sample "forecasts" are presented in this section to demonstrate the techniques that can be used to programmatically apply forecasting models to the TABIS data. The emphasis here is on demonstrating the programmatical techniques -- more sophisticated analyses can be easily performed using the code demonstrated in these examples as a starting point.

The first two examples (Sections 6.4.1 and 6.4.2), were attempted as part of an NTC project. There are some fundamental problems with these forecasts, for example they attempt to forecast nearly twenty years of future values with only five to seven years of past data. Despite their short-comings, the examples provide a good framework for users to modify and produce their own programmatical forecasts, using more data and more sophisticated forecasting techniques.

The third example (Section 6.4.3), demonstrates how data from several disparate data sets can be easily selected using the TABIS interface, and then combined and analyzed using sophisticated data analysis tools such as SAS/INSIGHT

6.4.1 Forecasts from NPD Apparel Consumption Data Integrated with Census Population Projections

This first example shows how NPD apparel consumption values can be integrated with Census population projections to forecast apparel consumption. The code was generated by the TABIS menus, with a few changes to simplify the example.

First, a query is coded to select the desired data. The apparel consumption, in units, is selected from the NPD data set for the desired product, for all age groups reported by NPD. Then the Census population projections for the corresponding age groups and sex are selected (see Figure 6.25). The NPD and Census values are both converted to like units during the process, so they can be integrated.

The NPD and Census data sets are then integrated, and per capita apparel consumption values are calculated (Figure 6.26). The units per person values are more easily forecast than total units, total expenditure, per capita expenditure, or per capita expenditure in constant dollars.

Figure 6.27 shows the calculated "jeans per person" for each age group. Notice that the age 20-24 group shows a slight decrease in the number of jeans per person, but the other groups have a fairly steady number over the years.

Figure 6.27 Calculated Jeans per Person Plot

Next, a forecasting method is applied (see Figure 6.28). In this case, a very simple forecasting method is used -- the mean units per person (for_per) is calculated for each age group, over all the years for which consumption data are available. The for_per values are then used as the forecast units per person for all the future years to the year 2010. This method assumes that there is little or no change in supply and demand for the product -- an assumption which may or may not be true.

The calculated units per person and the mean are plotted together using the code in Figure 6.29. This allows for a visual "sanity check" of the forecasts. The code plots the number of units per person as black dots, and the forecasts as diamonds. The code produces a separate plot for each consumer age group ("by con_grp"), and all the plots are produced to the same scale (using the "uniform" option) so the plots can be easily compared.

Figure 6.30 shows a sample plot for age 35-44. Note that the data might show part of a cycle, but the forecasting technique produces a linear forecast. Also, none of the other age groups hinted as strongly of a cycle, so this might be coincidence.

Figure 6.30 Jeans Per Person Projection

After the units per person are forecast for each age group, these forecasts are multiplied by the Census' population projections for each age group, producing a forecast of the total units. The total jeans forecast and the population projection are then plotted together for each age group using the code in Figure 6.31. This plotting code produces a scale for the jeans forecast along the left axis, and the population projection on the right axis.

Figure 6.32. shows the resulting plot for age 35-44. Notice that although the data hints of a cycle, the linear forecasting technique used can not produce a cyclical forecast.

Figure 6.32 Total Jeans Projection (NPD)

The forecasting code shown in Figure 6.31. can be easily changed, and the code which performs the queries and produces the graphical output can be re-used, thereby allowing other forecasting techniques to be easily tried.

Although this simple forecasting technique seems to fit the jeans data fairly well, products for which there is a change in demand are more difficult to forecast. Figures 6.33 and 6.34 show two forecasts for men's tailored clothing -- one using the "mean" technique just described, and the other using a simple regression technique. Notice that the two forecasts differ widely. This shows one of the dangers of trying a forecast too far into the future.

Figure 6.33 Tailored Clothing Projection Using Mean

Figure 6.34 Tailored Clothing Projection Using Linear Regression

6.4.2 Forecasts from MRCA Apparel Consumption Data Integrated with Census Population Projections

This next example (Figure 6.35) shows how to apply the same (mean) forecasting technique to the Marketing Research Corporation of America (MRCA) panel data for jeans consumption, so you can compare it to the previous example (Figure 6.32) which forecast the NPD jeans consumption. It is always a good practice to compare data and forecasts from different sources and determine whether they produce the same, or at least comparable, results.

The code from the previous example was easily re-used, with a few changes, to apply the same programmatical forecasting technique to the MRCA data, and produce plots which can be easily compared. One difference is that there were only five years of MRCA panel data, compared with seven years of NPD data. Also, the MRCA data was stored in a different format, therefore the data had to be transposed so they are structured the same.

The consumption values and the forecasts are roughly the same for both the MRCA and the NPD jeans data, except for the 1992 consumption value. That value seems slightly high in the MRCA data, and might be an outlier. It would be advisable to confirm the 1992 value with both NPD and MRCA before basing decisions on the data.

Figure 6.35 Total Jeans Projection (MRCA)

6.4.3 Interactive Data Analysis and Exploration Utilizing TABIS and SAS/Insight in a Combined Approach

This example shows how TABIS can be used in conjunction with new, highly interactive, data analysis tools which are becoming available. The tool, the approach, and the results are described. Hopefully this example can serve as a starting point for more in-depth analyses, and future research, with interactive data analysis tools.

The tool chosen for this example is SAS/Insight. It allows users to explore and analyze their data using a graphical user interface (GUI). [INSIGHT] This tool is highly interactive, and allows users to easily perform statistical analyses and visualizations on their data by selecting variables of interest with the mouse, and selecting the desired analyses from pull-down menus or interactive selection screens. Several analyses can be performed at once, and data points of interest can be easily selected in one plot, and those points automatically become selected in the other plots and tables for easy identification and investigation.

First, the TABIS interface was run to generate sample code to query data of interest (data pertaining to the 35-44 age group, which might be related to jean sales) from several tables in the TABIS data warehouse. The 'tabis.sas' programs from five runs of the TABIS interface were combined into a single SAS program (Figure 6.36) which produces five SAS data sets from the following disparate sources: Census (population), NPD (jean sales), NPD (slacks sales) Bureau of Labor Statistics Consumer Expenditures (income), and Bureau of Labor Statistics (consumer price index for "Apparel Commodities, less Footwear").

Figure 6.36 Selecting Related Data from Several Sources

All of the code shown in Figure 6.36 was automatically generated by the TABIS interface, saving the user much time, effort, and chance of making errors. Similar code could be easily generated for other analyses.

The five data sets were then combined into a single one using the 'merge' function of the SAS data step, as shown in Figure 6.37. Combining variables from disparate sources would have been an arduous task, had they not been stored in an integrated fashion in the TABIS data warehouse -- this is one of the main benefits of TABIS over most other data bases and analysis systems.

Figure6.37 Merging Data Sets for Combined Analyses

With all the variables merged into a single data set, they can be easily combined and transformed. For example, several of the variables were combined to produce useful variations such as: per capita sales of jeans (JPP), per capita sales of slacks (SPP), combined sales of jeans and slacks (JS), and income in 1983 dollars (I83). The SAS code used to perform these transformations is shown in Figure 6.38. Figure 6.38.

Figure 6.38 Applying Transformations to Variables

Short variable names are used in this example so they will more easily fit in the tables and other SAS/Insight output, such as the 10x10 scatter plot. Table 6.5 shows a list of all the variable names along with their textual definition, for easy reference.

The following command was then used to start SAS/Insight, and allow interactive analyses to be tried on the data:

proc insight data=combined;

SAS/Insight displays the data in a familiar spreadsheet-like layout (as shown in Table 6.6) which allows the users to easily view and interact with the data.

Table 6.6 Interactive Table of Merged Variables

The first analysis chosen for this example is a grid of Scatter Plots (see Figure 6.39). Scatter plots can reveal a wealth of information, including correlations, dependencies, clusters, and outliers. [INSIGHT] SAS/Insight allows every variable to be plotted against every other variable on a single page, to allow analysis of all the variables at a glance.

The following steps were used to produce the scatter plot shown in Figure 6.39: 1) the column headers of all the variables were selected in the spreadsheet-like table (using Ctrl+MB1), 2) the "Analyze -> Scatter Plot (X Y)" menu was run, and 3) in the resulting scatter plot, the "Edit -> Windows -> Tools" pull-down menu was run to allow the "zoom" function to be used, automatically making the scatter plots larger and adding min/max values to the axes where space allows.

The interactive capabilities allow users to click on outliers, or other points of interest in the scatter plots, and see the same point selected in all other plots as well as the spreadsheet-like table. For example, by clicking on the outlier in the slacks/jeans (SLK/JN) plot, a "3" shows up beside that data point, observation 3 is highlighted in red in the spreadsheet-like table and all other points from 1988 are highlighted in red in all the other plots. The year 1988 appears to be high for slacks sales (SLK) and low for jeans sales (JN).

Figure 6.39 Interactive Scatter Plot of Merged Variables

The scatter plots show several strong correlations, but most of these are intuitively obvious - namely the correlations between YR (year) and several of the other variables, such as POP (population). Notice that, although there is a high correlation between YR and INC (income), there is not a strong correlation between YR and I83 (income, adjusted for inflation). The plots indicate that there also might be a correlation between JN (jeans) and SLK (slacks), with the exception of the 1988 "outlier". Remember, though, correlation does not necessarily indicate cause -- the apparent positive correlation between JN and SLK could just indicate that both of these have risen at the same time because a third variable, POP, has risen. Also, the scales used for the axes of the plots are auto-scaled based on the minimum and maximum values of data in those plots, so the scatter plots could be misleading in some cases.

Although it is difficult to conclude that there is a definite relationship between jean sales (JN) and any of the other variables based on the scatter plot, a "Multiple Regression" is presented as the next example. In this model, jean sales (JN) was chosen as the response variable - the response variable measures the outcome to be explained or predicted. The following three variables were tried as possibly explanatory variables or predictors of jean sales in the model: slacks sales (SLK), population (POP), and income (INC). Intuitively, one would presume that as population increases, JN sales would also increase. Also, an increase in SLK sales might displace jean sales, and a change in INC could presumably affect JN sales also. One drawback to using SLK as an explanatory variable is that, by the time you know SLK sales, it is probably too late for your JN sales estimate to be of much use.

The following steps were used in SAS/Insight to perform the multiple regression analysis: 1) the "Analyze -> Fit(Y X)" menu was run (using the default statistical methods, without any customization), 2) in the "fit variables dialog window," JN was selected and added to the "Y" list, and the SLK, POP, and INC were selected and added to the "X" list, and then 3) the "OK/Run" button was clicked, producing the output shown in Figure 6.40.

Figure 6.40 Multiple Regression ( JN = SLK POP INC )

The "Summary of Fit" (in Figure 6.40) contains summary statistics for the model. The "Root MSE" value is 3.1649 (this is the square root of the mean square error given in the Analysis of Variance table). The Root MSE is an estimate of population standard deviation in the regression model. The "R-Square" value is 0.8183, which means that over 80% of the variation in JN sales is explained by the fitted model. The "Adjusted R-Square" value is 0.6367 -- it is adjusted for the number of parameters in the model, and is more comparable over models involving different numbers of parameters than R-Square. [INSIGHT]

In the "Analysis of Variance" section, the "Sum of Squares" represents variation present in the data (calculated by summing squared deviations) from the three sources: "Model," "Error," and "C Total" (C Total is the total sum of squares corrected for the mean, and is the sum of Model Error). The "DF" (Degrees of Freedom) are associated with each sum of squares. The "Mean Square" is the "Sum of Squares" divided by its associated DF. [INSIGHT]

Also in the "Analysis of Variance" section, the "F Stat" tests the null hypothesis that none of the explanatory variables has any effect (i.e., the regression coefficients B1, B2, and B3 are all zero) -- In this model, the F Stat is 4.5050. The "p-value" (labeled "Prob > F") is used to determine whether to reject the null hypothesis. The p-value is the probability of obtaining, by chance along, an F Stat greater than the computed F Stat when the null hypothesis is true. The smaller the p-value, the stronger the evidence against the null hypothesis. If the p-value is very small, you can reject the null hypothesis and conclude that at least one of the explanatory variables has an effect -- in this case, the p-value is 0.1240. [INSIGHT]

The "Parameter Estimates" show the "tolerance" and "variance inflation". These measure the strength of interrelationships among the explanatory variables in the model. Tolerances close to 0 and large variance inflation values indicate strong linear association or collinearity among the explanatory variables. [INSIGHT]

The plot of residuals versus the predicted values (in Figure 6.40) does not show a random scattering of observations, and there appears to be an outlier -- this may indicate a poorly specified model. [INSIGHT]

The "Parameter Estimates" also show the results of the regression analysis of JN on SLK, POP, and INC. The regression model for the ith observation can be written as:

where JNi is the number of jeans sold (in millions); B0 to B3 are the regression coefficients; SLKi, POPi, and INCi are the values of the explanatory variables (slacks, population, and income); and Ei is the random error term. The Ei's are assumed to be uncorrelated, with mean=0 and variance=sigma2. Filling in the appropriate values, the equation becomes:

A plot of the actual and estimated JN sales (Figure 6.41) shows a fairly good fit (albeit not a great fit), despite the deficiencies of the model. [INSIGHT]

Figure 6.41 Plot of ( JN = SLK POP INC ) Model Results

Perhaps more insightful techniques could be used to produce an even better model -- techniques such as exploring the data in three dimensions. SAS/Insight provides an excellent tool for this type of analysis with their "Rotating 3D Plot" capability.

Figure 6.42 shows a 3D plot of the three main variables from the previous model (JN, SLP, and POP). By interacting with the plot, and rotating it in several directions, a prominent outlier becomes evident. By clicking on this outlier, it was determined to be the 1988 values.

Figure 6.42 Rotating 3D Plot of Variables of Interest

['refresh' browser to restart animation]

Therefore, a new model will be tried without the 1988 outlier values. Note, however, that this action reduces the number of annual values from seven to six, neither of which is an overwhelmingly high number for a statistically valid analysis.

In addition to eliminating the 1988 outliers, a better model might also be made by using transformed values such as Jeans Per Person (JPP), Slacks Per Person (SPP), and Income in 1983 Dollars (I83), instead of the "total" and non-adjusted values such as Jeans Sales (JN), Slacks Sales (SLK), Population (POP), and Income (INC) (which all tend to increase over time because of increases in population, inflation, and the like). With the exception of the 1988 outlier, there appears to be definite clustering in the JPP/SPP scatter plot -- a higher JPP seems to indicate a lower SPP. Also, with the exception of the 1988 outlier, there appears to be some possible correlation or clustering in the JPP/I83 scatter plot Figure 6.39.

Figure 6.43 shows the results of the model (JPP = SPP I83).

Figure 6.43 Multiple Regression ( JPP = SPP I83 )

The "Summary of Fit" for the (JPP = SPP I83) model with the 1988 outliers thrown out (Figure 6.43), shows that the "Root MSE" is 0.0204, the "R-Square" value is 0.9494 (which means that almost 95% of the variation in JPP sales is explained by the fitted model), and the "Adjusted R-Square" is 0.9156. These statistics, as well as the others shown in Figure 6.43, indicate that this model fits much better than the previous one.

Preparing the equation for predicted JN is slightly more complicated than the previous example. In this case, the JPP must be calculated, and then multiplied by POP to obtain total JN sales. The regression model for the ith JPP observation can be written as

and then the total jean sales can be calculated as

A plot of the actual JN sales data and the predicted values based on the model shows a fairly good fit (Figure 6.44).

Figure 6.44 Plot of ( JPP = SPP I83 ) Model Results

Many variations of this model should be tried, using the interactive and iterative capabilities provided by tools such as SAS/Insight, to determine whether the good fit is significant, or possibly a coincidence. For example, a similar model, with a high R-Square value, showed that declining suit sales were correlated to increasing population -- remember that a correlation does not necessarily indicate a causal relationship.

6.5 Evaluating Methods for Accessing TABIS Using Various Network Techniques

In addition to performing analyses with TABIS, an important aspect of the research was designing a system which others could use to perform their own analyses. To accomplish that objective, TABIS must be accessible from numerous computer platforms. This section shows how the objective has been met by describing several ways which TABIS has been successfully accessed.

6.5.1 Direct Access From Unix Workstations

The fastest, most convenient, and fully-functional access to TABIS is available from any of over 1,300 DEC, HP, and Sun Unix workstations at NCSU which have AFS access to the TABIS disk and SAS software.

Specifically, TABIS was tested and found to be fully functional on both the "College of Textiles" (COT) and the "EOS" DEC and the Sun workstations which are prevalent in the student computing lab at the COT. The COT computers are open to textile students only, and the EOS computers are for students enrolled in engineering courses. TABIS also performed with the same functionality on the DEC and Sun workstations used in the COT graduate labs and the faculty desktops, which are set up slightly different from the student computers.

TABIS was also tested, and found to be fully functional on the HP and Sun workstations in the "Unity" labs such as Avent Ferry Complex and Daniels Hall. The Unity computers are open to all registered NCSU students and faculty.

TABIS was also successfully run on the Sun workstations in the "Statistics Instructional Computing Lab" ("SICL") These computers are used by students in the statistics department, and students enrolled in statistics courses.

The Sun workstations used by the graduate students and faculty in the statistics department were the only Unix computers tried at NCSU that were not capable of running TABIS directly. The reason being, at this time, these computers can not access files stored using Andrew's File System (AFS). As the statistics Unix workstations become more integrated into the NCSU network, and support AFS, TABIS should run on them with full functionality also.

6.5.2 Remote Access Using Telnet and X Windows

This is another, slightly slower, access technique which can be used to get full access to all of TABIS's X Windows graphical functionality on a computer which has X Windows capabilities but that can not access the AFS files directly (such as the Unix workstations in the statistics department). To use this method, the user runs "xhost +" to allow another computer to display X graphics on their screen. They then remotely log into an account on a computer that can access the TABIS AFS files, run TABIS there, and display the results on their computer by setting the DISPLAY environment variable appropriately. The following figure shows example commands needed to run TABIS on the COT server "hamby" and display the results on the statistics computer "eslaba" (Figure 6.45).

In addition to displaying TABIS across campus, this technique could by used from anywhere on the Internet that allows X Windows packets to be passed through their network and firewall.

6.5.3 Remote Access Using Telnet and ASCII

This access method was primarily designed to provide basic TABIS functionality for users that do have remote ASCII access to TABIS-capable computers (via login through a Telnet connection), but that do not have X Windows graphics capabilities. The basic functionality includes running the TABIS menus and producing ASCII tables and ASCII-art plots of the data (see Section 5.2).

Although most students use the Unix workstations in the computing labs, many of the faculty have a desktop PC or MacIntosh computer which is connected to the campus network, but can not access AFS files or run X Windows. They can use their network connection to remotely login to a Unix workstation and run TABIS, and view the ASCII output on their screen. This type of access was successfully tested from a PC in the Management Systems Lab in the COT, as well as a MacIntosh in the D.H. Hill Library.

This access method was also successfully verified from a computer attached to the Internet at SAS Institute. Their network does not allow X packets to be transferred, but does allow TABIS to be run, and the ASCII tables and plots be displayed.

6.5.4 Remote Dial-In Access

Many students and faculty choose to dial-in to the NCSU Unix workstations from home, using a variety of hardware and software combinations. TABIS's basic ASCII tables and plots were successfully tested via basic dial-in sessions from a Unix workstation, a PC, a MacIntosh, and even a VT220 "dumb terminal" directly connected to a modem.

Emerging computer technology is now allowing X Windows packets to be passed over a phone line, which enables users to view graphical TABIS output via a dial-in session (with the proper software). A dial-in session from an HP workstation running HP's point-to-point serial networking PPL program (which is HP's version of the Serial Line Internet Protocol - SLIP) and X Windows was used to test this functionality. Although the response time on graphics display was slow, the full TABIS X Windows graphics functionality was available.

6.5.5 World Wide Web (WWW) Access

A TABIS home page (see Figure 6.46) was set up to allow users from around the world to view samples of TABIS output and graphics, without needing an account on the NCSU computers - all of the other methods of accessing TABIS require the users to have an account. Viewing home pages on the Internet, or "surfing the Net" as it is called, has become very popular in 1995/96.

Any user with Internet access can view the TABIS home page using a World Wide Web (WWW) browser such as Mosaic or Netscape Navigator. The browser must be able to use hypertext transfer protocol (http), and the user's computer must be able to view images stored using the graphics interchange format (GIF). Access to the TABIS home page was tested from workstations across campus, workstations at SAS Institute in Cary, NC, and from a dial-in PPL session. Usage statistics indicate that the TABIS home page has been frequently accessed by computers from around the world.

The universal resource locator (URL) address for the TABIS home page is as follows:


Note that the original website is gone now, but I'm working on recreating it under the following new location:


Very recently, some of the data sources used to populate the TABIS data warehouse have made some of their data available in their own web pages. Hyper-text links have been established at the bottom of the TABIS web page to allow users to easily view these other web pages as they become available.

6.6 Application to Real World Information Needs

Perhaps one of the most important objectives of the TABIS research is to utilize TABIS to satisfy real world information needs (as opposed to answering purely hypothetical questions), both in academia and in industry. This section shows some examples of how TABIS has been used to satisfy real information needs.

6.6.1 Forecasting U.S. Apparel Demand to the Year 2010 by Age and Product Type

Forecasts of the U.S. apparel demand for the years 1991-2010 were prepared as part of a National Textile Center (NTC) project. This project was the main driving force which led to the creation of TABIS. The programmatical forecasting models developed for the NTC project are described in detail in Sections 6.4.1 and 6.4.2. Due to the proprietary nature of the data, the major emphasis of the research was on developing the programmatical forecasting infrastructure, rather than the actual forecasts.

Several papers, poster sessions, and reports were presented to share the results of this forecasting research. These include: a paper presented at the 1993 Academic Apparel Research Conference [AARC], a paper/poster display at the 75th International Conference of the Textile Institute [TI75] held at the 1994 Bobbin Show, poster displays at the 1993/94 National Textile Center (NTC) conferences, annual reports for the NTC [eg, NTC], a report of forecasts presented to the NPD, and reports prepared for the American Textile Manufacturing Institute (ATMI) in exchange for data from the past 10 years of the Textile Hilights, magazine.

6.6.2 Performing Customized Analyses for the American Apparel Manufacturers Association (AAMA)

Several of the previously described analyses were performed under the direction of the American Apparel Manufacturing Association (AAMA) to satisfy the information needs of U.S. apparel manufacturers. The employment maps and the various methods developed to visualize the population projections were of particular interest. In addition to those analyses, which were made accessible through the TABIS menus, two examples of customized analyses which were prepared for the AAMA are described below.

One project produced customized tables of the National Income and Product Accounts (NIPA) Personal Consumption Expenditures (PCE) data, which help identify changes in consumer patterns between 1959 and 1991. The tables calculate women's, girls' and infants' personal consumption expenditures on apparel as a percentage of PCE on nondurables, services, total PCE, and total apparel PCE. The same was done for men's and boys' apparel PCE, and the calculations were made in both current and constant dollars.

Another project, in progress by Ronald Funderburk, involves creating better forecasting models for men's suits. These models include more factors than just the previous suit consumption and population, and use more sophisticated statistical forecasting methods than the simple "mean" forecasting method used to develop the programmatical forecasting infrastructure.

6.6.3 Identifying Counties With a High Number of Apparel Manufacturing Establishments for National Textile Center (NTC) Research Project

In addition to the National Textile Center (NTC) project on forecasting apparel demand to 2010 [NTC], TABIS's functionality was also utilized in another NTC project. A joint project between several universities utilized TABIS maps of the County Business Pattern (CBP) data to identify counties with a high number of apparel manufacturing establishments. The maps and reports enabled the researchers to target the counties with a high number of apparel manufacturing establishments for further studies. A full page TABIS map showing the number of apparel manufacturing establishments per county was included in their annual report. Peyton Hudson, now retired, was the NCSU COT's representative on this research team.

6.6.4 Recruiting Textile Students by Developing a TABIS World Wide Web (WWW) Page

When World Wide Web (WWW) "surfing" was first becoming popular in early 1994, the NCSU College of Textiles (COT) decided to utilize it as a tool to recruit prospective new students from around the world.

Since the TABIS research includes both graphics and computer science - two topics which are very popular on the Web - the TABIS team was invited to prepare a Web page under the NCSU College of Textiles (COT) home page. The resulting TABIS Web page (see Section 6.5.5) was completed in February 1994, and was one of the first Web pages made available under the COT home page. The TABIS Web page is one of the most graphically impressive Web pages at the COT, and is one of the elite few Web pages anywhere which includes graphical animations.

As a result of the excellent TABIS Web pages under the COT home page, the COT has received several inquiries from prospective textile students from around the world asking for information on the programs offered.

6.6.5 Using TABIS as a Teaching Tool in Textile Classes

TABIS has been used by several classes at the NCSU College of Textiles, providing the students with high-quality data and graphics for their reports. The arrangement has been mutually beneficial, as the students provide excellent feedback which has helped lead to improvements in TABIS.

TABIS was used as an exercise in the 1993 and 1994 "Introduction to Textiles" (T105) class which all freshman textile students are required to take. Students learned how to run TABIS, and each student prepared a customized map of the textile employment, by county, in their home state.

The exercise proved to be beneficial to both the students and the TABIS project itself. The students learned how to use a tool which can provide valuable data and graphics for the projects and papers they might work on in other classes throughout their four years in the College of Textiles. Also, having several hundred students who were novice computer users try out TABIS provided very valuable feedback which prompted many improvements in TABIS.

It was apparent from the user feedback that the menus needed to be more user-friendly, and have default-values. A major re-programming effort was undertaken to add default-value functionality to each of the user prompts in each of the over 100 C programs which comprised the TABIS interface. The default values improved the user-friendliness, and decreased the likelihood of user-error.

Also, faster access to the large County Business Patterns (CBP) data sets, and faster printing of the U.S. maps by county on the 2Mb printers in the student labs, were needed. The CBP data set was indexed, and the query rewritten in a more efficient manner, to solve the access speed problem (see Section 4.4.2). The printing problem became apparent as a class of 35 students all tried to print their maps during the same one-hour class period - the dot map technique described in Section 6.2.3 was developed to speed up printing dramatically, and thereby eliminate the bottleneck at the printer.

Another class which used TABIS was the "Special Topics in International Trade and Apparel" (TAM 598C). This is a special graduate course taught by Carl Priestland (Chief Economist of the AAMA), and has a strong emphasis on data related to apparel manufacturing and sales. TABIS can be used to analyze and display much of the data discussed in the class, and special TABIS training sessions are presented to the members of this class, when needed.

6.6.6 Using TABIS in Graduate Research Projects

In addition to the previous example analyses, TABIS has also been utilized to satisfy the information needs of several graduate students performing research at North Carolina State University (NCSU).

For example, one graduate student performing research jointly for Statistics and Textiles, has become an integral part of the TABIS research. He is using TABIS data to develop statistical forecasting models of consumer apparel demand, and is also merging in his own data. TABIS logs indicate he ran TABIS over 300 times between 20dec1994 and 16nov1995. Since he often uses Sun workstations in the Statistics department which can not access AFS files, and also dials in from home, he has helped test numerous remote access methods, such as X Windows access across the network and ASCII dial-in capabilities.

Another graduate student, from the Sociology department, has also used TABIS in her research on textile and apparel employment. TABIS log files indicate she ran TABIS over 200 times between 06oct1994 and 20oct1995. Feedback from this user has led to several enhancements in TABIS functionality. TABIS was also ported to the HP workstations on the NCSU Unity network, since those were the only computers she could access.

< Previous |TOC | Next >