Data structures
Despite dozens of formats, geospatial data have common traits. Understanding these traits can help you approach and understand unfamiliar data formats by identifying the ingredients common to nearly all spatial data. The structure of a given data format is usually driven by its intended use. Some data is optimized for efficient storage or compression; some is optimized for efficient access, some is designed to be lightweight and readable (web formats), while other data formats seek to contain as many different data types as possible.
Interestingly, some of the most popular formats today are also some of the simplest and even lack features found in more capable and sophisticated formats. Ease of use is extremely important to geospatial analysts, because so much time is spent integrating data into geographic information systems as well as exchanging data among analysts. Simple data formats facilitate these activities the best.
Common traits
Geospatial analysis is an approach applying information processing techniques to data with geographic context. This definition contains the most important elements of geospatial data: geo-location data and subject information. These two factors are present in every format that can be considered geospatial data. Another common feature of geospatial data is spatial indexing. Also related to indexing are overview data sets.
Geo-location
Geo-location information can be as simple as a single point on the Earth referencing where a photo was taken. It can also be as complex as a satellite camera engineering model and orbital mechanics information, to reconstruct the exact conditions and location under which the satellite captured the image.
Subject information
Subject information can also cover a wide range of possibilities. Sometimes the pixels in an image are the data in terms of a visual representation of the ground. Other times an image may be processed using multispectral bands, such as infrared light, to provide information not visible in the image. Processed images are often classified using a structured color palette, linked to a key, describing the information each color represents. Other possibilities include some form of database with rows and columns of information for each geo-located feature.
Spatial indexing
Geospatial data sets are often very large files easily reaching hundreds of megabytes or even several gigabytes in size. Geospatial software can be quite slow trying to repeatedly access large files when performing analysis. As discussed briefly in Chapter 1, Learning Geospatial Analysis with Python, spatial indexing creates a guide, which allows software to quickly locate query results, without examining every single feature in the data set. Spatial indexes allow software to eliminate possibilities and perform more detailed searches or comparisons on a much smaller subset of the data.
Indexing algorithms
Many spatial indexing algorithms are derivatives of well-established algorithms used for decades on non-spatial information. The two most common spatial indexing algorithms are the Quad-Tree index and the R-Tree index.
Quad-Tree index
The Quad-Tree algorithm actually represents a series of different algorithms based on a common theme. Each node in a Quad-Tree index contains four children. These child nodes are typically square or rectangular in shape. When a node contains a specified number of features, the node splits if additional features are added. The concept of dividing a space into nested squares speeds up spatial searches. Software must only handle five points at a time and uses simple greater-than/less-than comparisons to check if a point is inside a node. Quad-Tree indexes are most commonly found as file-based index formats.
The following image shows a point data set sorted by a Quad-Tree algorithm. The black points are the actual data set, while the boxes are the bounding boxes of the index. Notice none of the bounding boxes overlap. The left image shows the spatial representation of the index. The right image shows the hierarchical relationship of a typical index, like the one above, which is how spatial software sees the index and data. This structure allows a spatial search algorithm to quickly eliminate possibilities when trying to locate one or more points in relation to some other set of features:
R-Tree index
R-Tree indexes are more sophisticated than Quad-Trees. R-Trees are designed to handle three-dimensional data and are optimized to store the index in a way compatible with the way databases use disk space and memory. Nearby objects are grouped together using any of a variety of spatial algorithms. All objects in a group are bounded by a minimum rectangle. These rectangles are aggregated into hierarchical nodes that are balanced at each level. Unlike a Quad-Tree, the bounding boxes of an R-Tree may overlap across nodes. Because of the relative complexity and the database-oriented structure, R-Trees are most commonly found in spatial databases as opposed to file-based formats.
The following diagram, from https://en.wikipedia.org/wiki/File:R-tree.svg, shows a balanced R-Tree for a two-dimensional point data set:
Grids
Spatial indexes also often employ the concept of an integer grid. Geospatial coordinates are usually floating point decimal numbers with anywhere from 2 to 16 decimal places. Performing comparisons on floating point numbers is far more computationally expensive than working with integers. Indexed searching is about eliminating possibilities first which don't require precision. Most spatial indexing algorithms therefore map floating point coordinates to a fixed-sized integer grid. Upon searching for a particular feature, the software can use more efficient integer comparisons rather than working with floating point numbers. Once the results are narrowed down, the software can access the full resolution data.
Grid sizes can be as small as 256 by 256 for simple file formats or can be as large as 3 million by 3 million in large geospatial databases designed to incorporate every known coordinate system and possible resolution. The integer mapping technique is very similar to the rendering technique used to plot data on a graphics canvas in mapping programs. The SimpleGIS
script in Chapter 1, Learning Geospatial Analysis with Python, also uses this technique to render points and polygons using the built-in Python Turtle graphics engine.
Overviews
Overview data is most commonly found in raster formats. Overviews are resampled, lower resolution versions of raster data sets, to provide thumbnail views or simply faster loading image views at different map scales. They can also be known as "pyramids" and the process of creating them is known as "pyramiding" an image. These overviews are usually preprocessed and stored with the full-resolution data either embedded with the file or in a separate file. The compromise of this convenience is the additional images add to the overall file size of the data set; however, they speed up image viewers. Vector data also has a concept of overviews, usually to give a data set geographic context in an overview map. However, because vector data is scalable, reduced-size overviews are usually created on-the-fly by software using a generalization operation as mentioned in Chapter 1, Learning Geospatial Analysis with Python. Occasionally, vector data is rasterized by converting it to a thumbnail image which is stored with or embedded in the image header. The following image demonstrates the concept of image overviews which also shows visually why they are often called pyramids:
Metadata
Metadata is any data which describes the associated data set. Common examples of metadata include basic elements such as the footprint of the data set on the Earth as well as more detailed information such as spatial projection and information describing how the data set was created. Most data formats contain the footprint or bounding box in data format. Detailed metadata is typically stored in a separate location in a standard format such as FGDC, ISO, or the newer European Union initiative which includes metadata requirements called the Infrastructure for Spatial Information in the European Community or INSPIRE.
File structure
The preceding elements can be stored in a variety of ways within a single file, multiple files, or in a database depending on the format. The following table shows the most frequently used storage formats for the common geospatial data elements explained previously. The elements in this table may be found in different combinations for different types of file formats.
Human readable formats such as XML files, spreadsheets, and structured text files require only a text editor to investigate. These files are also easily parsed and processed using Python's built-in modules, data types, and string manipulation functions. Binary-based formats are more complicated. It is typically easier to use a third-party library to deal with binary formats.
However, you don't have to use a third-party library, especially if you just want to investigate the data at a high level. Python's built-in struct
module has everything you need. The struct
module lets you read and write binary data as strings. When using the struct
module you need to be aware of the concept of byte order. Byte order refers to how the bytes of information that make up a file are stored in memory. This order is usually platform specific but in some rare cases, including shapefiles, the byte order is mixed within the file. The Python struct
module uses the greater than (>
) and less than (<
) symbols to specify byte order.
The following brief example demonstrates using the Python struct
module to parse the bounding box coordinates from an Esri shapefile vector data set. You can download this shapefile as a zipped file at the following URL:
https://geospatialpython.googlecode.com/files/hancock.zip
When you unzip this you will see three files. For this example we'll be using hancock.shp
. The Esri shapefile format has a fixed location and data type in the file header from byte 36 to byte 37 for the minimum x, minimum y, maximum x, and maximum y bounding box values. This example will execute the following steps:
- Import the
struct
module - Open the
hancock.zip
shapefile in binary read mode - Navigate to byte 36
- Read each 8-byte double specified as d, and unpack it using the
struct
module in little-endian order as designated by the<
sign.
The best way to execute this script is in the interactive Python interpreter. We will read the minimum longitude, minimum latitude, maximum longitude, and maximum latitude:
>>> import struct >>> f = open("hancock.shp","rb") >>> f.seek(36) >>> struct.unpack("<d", f.read(8)) (-89.6904544701547,) >>> struct.unpack("<d", f.read(8)) (30.173943486533133,) >>> struct.unpack("<d", f.read(8)) (-89.32227546981174,) >>> struct.unpack("<d", f.read(8)) (30.6483914869749,)
You'll notice that when the struct
module unpacks a value it returns a Python tuple with one value. You can shorten the preceding unpacking code to one line by specifying all four doubles at once and increasing the byte length to 32 bytes as shown in the following code:
>>> f.seek(36) >>> struct.unpack("<dddd", f.read(32)) (-89.6904544701547, 30.173943486533133, -89.32227546981174, 30.6483914869749)
If you are examining a lot of files or are dealing with an undocumented file format, using the seek
method and the struct
module can become tedious and repetitive. The next script called fmtDecode.py
attempts to make examining files a little easier. The script gives you a listing of files in the current directory and asks you to choose one. The script must be in the same directory as the shapefile. It then proceeds to read the file a few bytes at a time attempting to use every known data type in both little endian and big endian byte order. It presents you with a list of choices to pick from. Typically, the correct data type stands out from the other incorrect choices because it will be a simple number or character making the other choices obviously wrong. For example, two of the choices might be:
Little double: (-89.6904544701547,)
Or;
Big double: (2.1220012415e-314,)
The first choice looks like it might be a geospatial coordinate while the second choice clearly looks erroneous. You would type 1
and the script would move forward.
As you select the best candidate field from each part of the file, the script tracks your results. Once you make a choice you are given the option to make a note to capture what purpose the field may serve in the file format. If you get to the next field and decide you made a mistake, you can go back one field. At any time you can type exit
to leave the program. When you exit, the results are saved to a text file in the same directory. This text file creates a simple file specification. If you escape the program, usually by typing CONTROL-C
on most platforms, you can jump back multiple records. However, in this case the program caches your location, and exits. When you exit the program for any reason, your location in the current file is cached. When you run the script again and choose the same file, you are given the option to start from the location in the cache and the results are appended to the results file.
This script is a very simple brute-force script, but it does simplify the process of stepping through an unknown data format. It has been a key tool in reverse engineering several undocumented geospatial file formats. And like any script, you can easily modify it to better work with a particular file format of interest, as shown in the following code:
import struct import pickle import os def export(): print "Saving results" out = None if cached: out = file(oname, "a") else: out = file(oname, "w") out.write(header) for record in fileDesc: for field in record: out.write("%s\t" % field) out.write("\n") out.close() pickle.dump(cached, file(pickleJar, "w")) header = "POSTION\tFIELD\tSAMPLE\tTYPE\tBYTE_ORDER\n" fileDesc = [] files = os.listdir(".") count = 1 print "Available Files:" for f in files: print " %s. %s" % (count, f) count += 1 fnum = raw_input("Enter the number of the file to decode: ") fname = files[int(fnum)-1] base = os.path.splitext(fname)[0] pickleJar = "%s.p" % base cached = [] if os.path.exists(pickleJar): print "Cached session available." print useCache = raw_input("Use it? Yes (Y), No (N)?") if "y" in useCache.lower() or useCache == "": cached = pickle.load(open(pickleJar, "r")) else: cached = [] oname = "%s_decode.txt" % base f = open(fname, "rb") loc = f.tell() f.seek(0,2) eof = f.tell() f.seek(0) prev = 0 if len(cached)>0: print "Using cache..." f.seek(cached[-1]) prev = cached[-2] print "Starting at byte %s..." % f.tell() try: formats = {"char":{"format":"c","len":1}, "signed char":{"format":"b","len":1}, "unsigned char":{"format":"B","len":1}, "_Bool":{"format":"?","len":1}, "short":{"format":"h","len":2}, "unsigned short":{"format":"h","len":2}, "int":{"format":"i","len":4}, "unsigned int":{"format":"I","len":4}, "long":{"format":"l","len":4}, "unsigned long":{"format":"L","len":4}, "long long":{"format":"q","len":8}, "unsigned long long":{"format":"Q","len":8}, "float":{"format":"f","len":4}, "double":{"format":"d","len":8}} while f.tell() < eof: record = [] start = f.tell() record.append("%s\t" % start) cached.append(start) fields = [] print count = 1 try: # Little endian formats for fmt in formats: form = formats[fmt]["format"] bytes = formats[fmt]["len"] field = struct.unpack("<%s" % form, f.read(bytes)) print "%s. Little %s: %s" % (count, fmt, field) count += 1 f.seek(start) fields.append([str(field[0]), fmt, "little", str(bytes)]) except: pass try: # Big endian formats for fmt in formats: form = formats[fmt]["format"] bytes = formats[fmt]["len"] field = struct.unpack(">%s" % form, f.read(bytes)) print "%s. Big %s: %s" % (count, fmt, field) count += 1 f.seek(start) fields.append([str(field[0]), fmt, "big", str(bytes)]) except: pass print "%s. Go back to previous" % count print print "Current location: %s" % f.tell() choice = raw_input("Enter the number of one of the above options: ") choice = int(choice.strip()) desc = "" if choice != count: desc = raw_input("Enter a field description: ") record.append("%s\t" % desc) record.append("%s\t" % fields[choice-1][0]) record.append("%s\t" % fields[choice-1][1]) record.append("%s\t" % fields[choice-1][2]) f.seek(start + int(fields[choice-1][3])) prev = start fileDesc.append(record) elif choice == count: f.seek(prev) print "Going back to previous field." f.close() export() except KeyboardInterrupt: print reverse = input("How many records back? ") for i in range(reverse): cached.pop() pickle.dump(cached, file(pickleJar, "w")) print "The program will exit. Restart and use cached version." except: export()
Using this script against the 100-byte file header "hancock.shp" file in the first example, we get the following output:
There are other tools to reverse engineer file specifications but the goal of this book is to show you that, in most cases, Python is the only tool you need. And using Python as much as possible will increase your ability with the language, making it an even more useful tool. There are also software libraries for most data formats. But the ability to understand your data at the byte level will make you a better and more capable analyst.