nware.Rmd
Types of raw datasets are called tables. There are 9 tables from the raw data: general
, population
, provider
, provider_address
, provider_attributes
, provider_name
, provider_schedule
, shift
and rate
.
Types of clean datasets are called panels. There are 4 panels in the clean data: characteristics
, shift
, shift_agegroup
, schedule
.
The original data source is downloaded and shared with the team at a point in time. We call each round of data extraction a snapshot
.
We started by going over all the files of type .zip
, .xls
, .xlsx
, .csv
, .txt
and .xml
in the folder Google Drive/MinnCCAccess/Analysis/Data/Raw
. We identified the files that are from the Nware database. Then we classified them into two groups: supersets and subsets. A file is called superset if it has unique observations or columns that no other file in the folder has. A file is called subset if all the information in the file can be found in another superset file. If two or more files have the exact same information, only one of them is classified as superset, and all the others are labelled as subsets.
We went over every file by opening the files in Excel and checking their content. We wrote a log summarizing our findings on the contents of the file and their superset/subset status.
Then, we created a crosswalk table that contains information on the supersets of each file. Remember that a superset is a superset of itself. Then, we wrote a script that checks whether a subset is really a subset of its superset.
We also checked whether the crosswalk table contains all the Nware files in the raw folder.
We renamed a copy of each superset and moved them to a new folder, Data Cabinet/Nware/Step 1 Data
. The naming convention is as follows. The first part of the name contains information on the type of data that the file contains. The types are general
, population
, provider
, provider_address
, provider_attributes
, provider_name
, provider_schedule
, shift
and rate
. The second part of the name is the ISO 8601-formatted date of the snapshot.
Then, we created a crosswalk table that contains the old and the new paths of the files. Then, we wrote a script that checks whether the old and the new files are actually the same. We manually checked the file names, by checking whether the file name actually represents the correct table type and the snapshot date.
In this step, we do two things. First, we had multiple periods of time for each table, since there were many snapshots. We should have merged the tables so that we have the data for each table for all dates in a single dataframe. Second, some variables changed tables over time. For example, a variable in general
table before 2017 started to appear in provider_name
table starting 2017. In addition, there were multiple tables with the same unit of observation. We decided to combine all the tables with the unit of observation into a single panel. We also change variable names in this step to camel case.
Renaming the variables before merging the files allowed us to avoid having multiple columns for the same variable (with a different name across snapshots).
The final versions of the panels include all the snapshots from various dates.
An important thing to note here is that May 2012 snapshot of General table has problematic rows. We dropped these rows before appending this file.
There are four different units of observations we found in the original tables.
general
, provider
, provider_address
, provider_attributes
and provider_name
are in the first group. shift
is in the second group. population
and rates
are in the third group. Finally, provider_schedule
is in the fourth group. This means that we can actually have 4 tables instead of 9, by merging the tables with the same units of observations by using the correct primary/foreign keys.
It is important to note that general
also contains group 2, group 3 and group 4 -type data before 2017 in a very messy way. We will talk about how we take those variables and move them to their corresponding tables in the following sections.