Creating Structured Data Sets for Visualization

vize lab narrow banner

These are essential tips to guide you in collecting and generating clean ethnographic data sets that you will be able to analyze and visualize using a flexible variety of existing and future software tools and media formats.

Write your notes using a plain text editor such as TextEdit, or maintain a backup set of your fieldnotes in a plain text format (e.g. .TXT files). Keeping an archive of your notes in a plain text format will enable you to cleanly import your fieldnote data, or copy and paste selected notes, into specialized software that can curate and perform accurate discourse analysis with them (e.g. Nvivo). Furthermore, an archive of fieldnotes saved in a plain text format will ensure they will be accessible for work in future generations of software and technologies. The safest way to create your fieldnotes with basic text formatting and preserve them for a wide variety of uses in the long run, is to use plain text software when you begin your research.

Explanation: Files from proprietary word processing software (e.g. Word) contain invisible formatting instructions that may later become mixed with your real data. The files themselves may not only be readable in other word processing tools, but they will eventually be unreadable within some future version of the same software. Saving proprietary files as plain text files strips the propietary formatting information (e.g. font styles, indentation, comments, markup) and makes them widely accessible. 

Create or Save structured data from spreadsheets in plain formats such as .TSV (text separated values) or .CSV (comma separated values). The formatting issues described above for proprietary word processing software applies to structured and tabulated data, too. Documents created in Excel, for example, contain invisible formatting data and the files themselves may well become inaccessible at a future time. Thus, exporting and saving your data as .TSV or .CSV files will ensure access to your data from a wider variety of existing and future visualization tools.

Use a consistent and cogent set of categories and units to describe your data, whether you are collecting measurable or descriptive data, respectively. 

Ensure that column names are the same for the same types of data if you are using multiple worksheets, workbooks or data sets. For example, if the data for “year of birth,” "birthday," and "year" are the same type of data, they should be defined as a single term such as “birthyear” in all your data sets. This consistency is especially important for joining and analyzing data sets that are derived from multiple sources or time periods.

Avoid leaving empty spaces in column headers that describe data, as required by some software. For example, age (months) might be age_months; “birthday” could be “birthdate”.

Review and clean up tabular data. Whether you collect them or generate them yourself, tables may not be formatted for accurate visual analysis. Check for errors, obvious outliers, typos and empty (null) rows. Ensure that columns are formatted as data types that correspond to how you will use their data. Spreadsheet software may assign inaccurate formats to columns (e.g. numbers, dates, text) when importing new datasets. For example, check that dates are defined as such. Change numbers to text for columns with numbers that are actually not quantities to be computed. For example, numbers may be used codes that index non-numerical values (e.g. zip codes as geographic references, medical codes). Check to see if monetary values are saved as numbers, currency, or accounting depending on how you will use them. Remove any pre-aggregated data that is not part of the raw data itself, such as totals or sub-totals that contain sums, averages, counts, etc. Remove introductory text such as titles or legends which might appear apart from your column headers, and flatten any sub-headers by creating a new  columns for major headers in the hierarchy. Conversely, be sure all columns have headers. Finally, fill in any blank cells and remove blank rows; check where white spaces may appear in your headers and data; trim leading and trailing whitespaces and collapse consecutive whitespaces. OpenRefine is an excellent free tool for managing these issues and for cleaning and organizing datasets before you import them into visualization software.

Keep a record of your data sources and record the last time the data set was collected, edited or published, and when you accessed or generated the data.

You can extract tabulated data from PDFs and save as CSV tables using Tabula.


VizE Lab for Ethnographic Data Visualization
320 Aaron Burr Hall

Contact Us:  [email protected]