Section Two: Primary and Secondary Data And Data Quality and Error

In science, all data can be broken into two categories - primary data and secondary data - and each science independently defines how different kinds of data fits into each category.  Within GIS, primary data is any data digitized within the software or collected by hand using a GPS receiver, as we examined in Chapter Six.  Secondary data is any data which is output from a geoprocessing tool or data that has been edited without referencing aerial imagery.  Whether data is primary or secondary can effect the quality based on the creation method.  Since primary data is creating by directly looking at an image on screen or by interacting with the real world, the potential for high quality data is much greater, as long as the metadata or file name include key features, for example the map scales the data is good for or the stored coordinate system.  Geoprocessing tools, on the other hand, produce data using established algorithms for how the input data interacts with other data or map measurements.  There is a chance with secondary data to have artifacts or false data. For example, ArcGIS has a raster tool which can estimate where streams might be using a digital elevation model (DEM) based on how the elevation changes over a series of pixels.  This could be a useful tool determining where intermittent streams might be present following flash floods, but could create false data in regions where rain is a constant presence and the flow channels are well-established.

8.2.1: Errors of Omission and Errors of Commission

Errors in GIS data can be divided into two main categories, errors of omission and errors of commission.  An error of omission can be defined as leaving out features or attributes from a data set while errors of commission can be defined as creating false or erroneous features or attributes, regardless of intent. For example, a layer where a street is digitized to show it incorrectly passing through a field may be a result of errors of commission while a world countries layer without Ireland may be a result of errors of omission.  The terms 'error of omission' and 'error of commission' do not refer to the intent of the technician, as they may have accidentally left out Ireland, or it may have be on purpose, whether it was malicious or not.  Malicious intent is actually quite rare, as it's more often the case that someone didn't need some features for the project they were working on and instead committed errors of omission based on project need.  Errors of commission, while it seems they might be a more likely candidate for malicious intent, might be also be caused by project need, a representation of what was or what might be, or using attributes which reflect local knowledge versus recorded knowledge (calling a mountain by the local name of "Devil's Hill" instead of the officially recorded name of "Anderson Peak").  Either way, malicious intent is much more rare than specific project-need data.

8.2.2: Cascading Error

Cascading error is not a category of error like errors of omission and error of commission are, but instead is a result of an error within the model of GIS.  When an error, intentional or unintentional, is made early in the process, that error is carried through the entire project, and thus none of the data can be trusted.  If a technician is deriving output data from erroneous input data, the product, too, is erroneous - possibly more so then the input data, depending on the tool and how much it relies on assumptions.  If the technician is unaware of the mistake, the entire project is a lie and could possibly have some rather substantial consequences if claims or conclusions are made based on the resulting data which show incorrect boundaries, totals, interactions.  

The only way to repair a cascading error is to work methodically backwards, finding the erroneous data, them moving through the project process a second time with the correct data.  It's possible that the very first layer added to a project had errors or was of a quality-unfit for the project at hand, meaning the technician will need to complete the entire project again.  In fact, as we will see in the sections ahead, it's possible to make errors before any data is created and analyzed.  Any place in the model of GIS - reality, conception, representation, analysis, documentation, distribution, and storage - can be source of error.

While error can never be completely avoided (as we know that some generalizations of the world must be made in order to create GIS data in the first place), the overall quality of the data can be minimized by assessing four basic criteria: accuracy and precision, completeness, currency, and consistency and through proper and complete documentation.

8.2.3: Accuracy and Precision

Accuracy vs Precision: A Review

Accuracy is defined as the ability to hit your target

Precision is defined as the ability to repeat your results almost exactly the same every time

Error, in GIS and Cartography, is defined as the linear distance between the represented data and reality.

Results are considered both accurate and precise when the results are “on target”, every time. This concept is used in many scientific disciplines, but in GIS and Cartography, we use accurate and precise to define when our analysis and display are as close to the real world as possible, with as little error as possible.

precision_accuracy

Feature and Attribute Accuracy

Accuracy, in terms of GIS data, refers to how close our representation of reality is to the accepted or real world values. Accurate features exist in the GIS in the most correct position, or positional accuracy.  When a vector feature is lined up extremely well with the features as seen in imagery or with vertices that have coordinates to match measured GPS coordinates (assuming those coordinates have been collected several times to increase the sample size which in turn increases the accuracy or a high density of observations), the features can be considered to have a very high positional accuracy.  Positional accuracy is independent from feature precision, meaning data which is accurate is not inherently precise, nor is precise data inherently accurate.  Both Figure 8.1a and 8.1d, below, have accurately digitized features, independent of the feature precision. 

In cartography and GIS, there are standard distances based on map scale set by the USGS that define accepted positional accuracy for creating features and the percentage of features on the resulting map that must fall within these tolerances to be considered accurate. In other words, standards have been set to say that your data cannot be considered accurate if it does meet the below standards.

Map ScaleAccepted Error Distance
1:1,200± 3.33 feet
1:2,400± 6.67 feet
1:4,800± 13.33 feet
1:10,000± 27.78 feet
1:12,000± 33.33 feet
1:24,000± 40.00 feet
1:63,360± 105.60 feet
1:100,000± 166.67 feet

It should also be noted that there is a problem with scale-based accuracy that is inherent to GIS, which is not present in printed maps. Vector data, by definition, can be scaled to any size without loss of quality. This quality, however, refers to the graphics - vector features do not become distorted with a change of scale - and not to the positional accuracy. As you change the scale of a GIS project using vector features, there may be uncorrected accuracy errors that come from the standards applied when the data was collected, not when the data is used at a different scale. This is not to say that changing the scale of vector feature will completely ruin your project, but corrections though editing may need to be made before geoprocessing tools or other types of analysis can be run.

Attribute accuracy is independent, but related to positional accuracy. While positional accuracy is concerned with where a feature is positioned on a map, attribute accuracy deals with how close the values in the attribute table match the accepted and real-world values. Accuracy for text attributes could be concerned with the spelling of a name while numeric attribute could be concerned with a measurement. There is only one acceptable and accurate spelling of Colorado: “C - O - L - O - R - A - D - O”. A length of a road, however, may have three different values on record, thus making the value you choose to add to the table or the value calculated by the GIS to be less accurate. In order to increase the accuracy of your length value, you will need to measure the road a few times, another example of density of observations.

Feature and Attribute Precision

Precision, in GIS, refers to the perfection of digitized or collected features and the detail of the attributes. In the above section, we looked at the length of a road and how we can increase the accuracy of the value by measuring it more than once.  Attribute precision, in this case, would be how many decimal places you choose to measure that road to or how you would walk along it with a measuring wheel. Does your agency require roads to be measured down to five decimal places or that you follow every slight bend exactly as is exists in the real world? These requirements define the level of precision in the values you will add to the attribute table.

When it comes to features, much like following the road exactly as it curves in real life, precision is measured by the detail present in the digitization process. Features are more precise when you take the time to place a vertex at every curve you see or perfectly following the shore of a lake or the sidewalk of a park. In Figure 8.1, below, we see that both 8.1a and 8.1c have precise features, with only 8.1a having features which are both precise and accurate.

Figure 8.1: Comparing Feature Accuracy and Precision
feature_accurate_and_precisefeature_not_accurate_nor_precise
8.1a: Features are both precise and accurate, meaning the feature has been digitized based on the actual outline of the building and the digitized features line up perfectly with the building position as seen in the imagery.8.1b: Features are not accurate nor precise, meaning the digitized features have are not lined up with imagery nor do the features follow the buildings exactl as they appear in the imagery.
feature_accurate_but_not_precisefeature_precise_but_not_accurate
8.1c: Features are precise but not accurate, meaning the outlines follow the buildings well as seen in the imagery but do not line up well with the building position in the image.8.1d: Features are accurate but not precise, meaning the digitized features line up well with the building position in the imagery but do not follow the exact outline of the buildings.

Conceptual and Logical Accuracy and Precision

Two other kinds of accuracy and precision in GIS are conceptual and logical accuracy and precision. Conceptual error is introduced in the second part of the model, “conception”, in which a GIS technician doesn’t or isn’t capable of seeing all the parts of the project in the beginning - a lack of accuracy and precision in the planning process. Conceptual error may be either an error of omission - the technician can read the future and all problems which may arise, or an error of commission - the technician did a poor job planning for the future of the project.

Logical accuracy and precision is concerned with the “storage” and “analysis” portions of the model. For storage, we need to look back to Chapter Four where we introduced the organizational data model used in the class by means of folder structure.  This is, by design, a plan for logical accuracy and precision. By using the same folder structure and naming the folders identically from project to project, the storage structure is accurate and the locations of the folders are precise. Anyone can, at any length of time from now, look back into the folder structure and quickly locate the data they need, since all the files are names with memorable and meaningful names. 

The Cost of Accuracy and Precision

Creating accurate and precise data is, to a point, a fluid process.  Depending on the specific needs of your project, you may need to take a large amount of time to create highly accurate and precise data, or you might be able to get away with a quick and dirty shapefile. From this, we can begin to see that, for all intents and purposes, accuracy and precision have an inverse relationship with time and money. To increase accuracy and precision, you must take more time (and if you’re on the clock, money) to carefully create the data you need, whether that is digitizing features on the computer, collecting data by means of GPS or survey, or the quality of the camera used in remote sensing. In GIS we simply cannot have fast, cheap, accurate, and precise all present in the same data.

8.2.4: Feature and Attribute Completeness, Consistency, and Currency

Three factors which contribute to feature and accuracy quality are completeness, consistency, and currency.  Each factor has an independent meaning but they are often intertwined when it comes to data creation and analysis.  

Completeness

In addition to accuracy and precision, quality data is complete, both in features represented and the attributes associated with those features. For example, if you downloaded a layer of the United States, you’d expect it to have all fifty states and possibly Washington DC, Puerto Rico, and Guam. You’d also expect the attributes of that layer to have all the state names in the NAME field and the total areas in the AREA_MILE field. If the layer is missing Rhode Island because whoever digitized the layer thought that no one would notice or care since it’s so small, the data is not of the best quality it can be due to a missing state. When looking at errors of commission and errors of omission, this example would actually be both since the features is missing (error of omission) and the technician did it on purpose (error of commission).  Through the error of omission in this case, the data can also be described as incomplete.

Consistency

Consistency of data quality is the overall uniformity of quality between features and attributes within a single dataset or how the data is organized within a single geodatabase, feature dataset, or folder structure. In regard to feature constancy and uniformity between single features within a single feature class or shapefile, there is a generalized expectation that if features are digitized in an extremely accurate and precise way for, say, states on the west coast of the United States, the states on the east coast would be digitized at the same level of accuracy and precision.  When data is supposed to be accurate for a map scale of 1:24,000, an end-user expects all the nooks and crannies of the non-straight lines of the states to be present at a equal accuracy throughout.  It wouldn't be expected that the technician in charge of digitizing those United States to get tired of the task and do a cruddy job towards the end, resulting in an inconsistent quality throughout the dataset.

Attribute consistency is concerned with values in an attribute table matching the field heading and existing in the proper place. Where completeness of data in concerned with all the necessary values being present, consistency of data is concerned with attributes being in the proper place. As you can see, when it comes to a quality attribute table, you really want both complete and consistent data.

Structural consistency, when it comes to feature classes in a geodatabase, is those feature classes being in the same projection (where applicable) and the features classes within a single feature dataset being of the same type or category - for example, highways, streets, street signs, etc. all being in the “Transportation” feature dataset.  When it comes to folder and shapefiles, consistency is logical folder structure (also “logical accuracy”, but several of the concepts of data quality overlap), memorable and meaningful names of files, and the shapefiles and raster existing in the same projection (where applicable).

Currency

On major factor when deciding if a data set or layer is of the highest quality for your project is time. While we can clearly see that completeness and consistency are a “yes” or “no” variable - “yes, this layer has all of the data I need to complete my project” or “no, it’s missing features or values” - currency as a measure of data quality is a little more fluid. Depending on your project needs, you can consider a data set which is one week old or one year old to be the best fit.

For example, if you are looking at the results of a major flood, only data that falls between the time when the flood occurred and the restoration began can be considered to be of the highest quality. Using a data set that was created just one day before the flood or a few months after repairs had been completed can be considered “poor quality” data. However, if your project is looking at the geology of the region, a data set that was created 10 year ago is likely of equal quality as one created last week, since a geologic time line is so much longer than a flood time line. In the case of the geology layer, quality is less defined by currency and more emphasis is put onto completeness and consistency.