Nware Data Cleaning • minnccaccess

Glossary

Table: Types of raw datasets are called tables. There are 9 tables from the raw data: general, population, provider, provider_address, provider_attributes, provider_name, provider_schedule, shift and rate.
Panel: Types of clean datasets are called panels. There are 4 panels in the clean data: characteristics, shift, shift_agegroup, schedule.
Snapshot: The original data source is downloaded and shared with the team at a point in time. We call each round of data extraction a snapshot.

Step 1: Going over raw files

Checking the contents

We started by going over all the files of type .zip, .xls, .xlsx, .csv, .txt and .xml in the folder Google Drive/MinnCCAccess/Analysis/Data/Raw. We identified the files that are from the Nware database. Then we classified them into two groups: supersets and subsets. A file is called superset if it has unique observations or columns that no other file in the folder has. A file is called subset if all the information in the file can be found in another superset file. If two or more files have the exact same information, only one of them is classified as superset, and all the others are labelled as subsets.

Determining supersets and subsets

We went over every file by opening the files in Excel and checking their content. We wrote a log summarizing our findings on the contents of the file and their superset/subset status.

Then, we created a crosswalk table that contains information on the supersets of each file. Remember that a superset is a superset of itself. Then, we wrote a script that checks whether a subset is really a subset of its superset.

We also checked whether the crosswalk table contains all the Nware files in the raw folder.

Determining the table type

We checked every file to determine what type of a table that file contains. We used this information to rename the files.

Determining the date

We used the folder names and/or the file names to determine the date the data contained in the file is captured from the database. We used this information to rename the files.

Renaming the files

We renamed a copy of each superset and moved them to a new folder, Data Cabinet/Nware/Step 1 Data. The naming convention is as follows. The first part of the name contains information on the type of data that the file contains. The types are general, population, provider, provider_address, provider_attributes, provider_name, provider_schedule, shift and rate. The second part of the name is the ISO 8601-formatted date of the snapshot.

Then, we created a crosswalk table that contains the old and the new paths of the files. Then, we wrote a script that checks whether the old and the new files are actually the same. We manually checked the file names, by checking whether the file name actually represents the correct table type and the snapshot date.

Step 2: Merging across snapshots for each table

In this step, we do two things. First, we had multiple periods of time for each table, since there were many snapshots. We should have merged the tables so that we have the data for each table for all dates in a single dataframe. Second, some variables changed tables over time. For example, a variable in general table before 2017 started to appear in provider_name table starting 2017. In addition, there were multiple tables with the same unit of observation. We decided to combine all the tables with the unit of observation into a single panel. We also change variable names in this step to camel case.

Renaming the variables

Renaming the variables before merging the files allowed us to avoid having multiple columns for the same variable (with a different name across snapshots).

Merging across time

The final versions of the panels include all the snapshots from various dates.

An important thing to note here is that May 2012 snapshot of General table has problematic rows. We dropped these rows before appending this file.

Merging tables into panels

There are four different units of observations we found in the original tables.

Provider - Date
Provider - Shift - Date
Provider - Shift - Age Group - Date
Provider - Shift - Day of Week - Date

general, provider, provider_address, provider_attributes and provider_name are in the first group. shift is in the second group. population and rates are in the third group. Finally, provider_schedule is in the fourth group. This means that we can actually have 4 tables instead of 9, by merging the tables with the same units of observations by using the correct primary/foreign keys.

It is important to note that general also contains group 2, group 3 and group 4 -type data before 2017 in a very messy way. We will talk about how we take those variables and move them to their corresponding tables in the following sections.