Dirty Data Got You Down? Clean it Up
You can’t turn a corner these days without someone touting Big Data, data analytics and the power of data-driven decision making as a solution to some of our biggest challenges.
Big data is critical to cyber defense, threat detection, fraud prevention and future advancements in precision medicine. Yet all of these potential advancements hinge on the quality of the incoming data, including historic data compiled over years and even over decades. But turning existing troves of data into actionable insights is harder than it looks.
Most raw data is too “dirty” –inconsistent, inaccurate, duplicative, irrelevant or incomplete – to be actionable without significant work to clean it up. So-called dirty data costs U.S. businesses $600 billion a year, according to The Data Warehousing Institute research firm. And the problem is likely worse in government.
Take federal spending data on USASpending.gov, for example. The General Accountability Office reported in 2014 that at least $619 billion representing 302 federal programs was missing. Two years later, GAO reported that up to 15 percent of data in the Pentagon’s Real Property Assets Database were inaccurate and incomplete. And in September, GAO found billions of dollars in faulty entries on ForeignAssistance.gov.
At the state level, seven out of 10 officials from 46 states reported that data problems were frequently or often an impediment to effectively doing business, according to a recent Governing survey. Even when bad data gets noticed, they don’t always get fixed. When a trade association complained to government agencies about the accuracy of data related to lead poisoning, agencies failed to make 59 of 87 correction requests.
For organizations looking to make sense of their data, the first step is to ensure the data are managed well – are consistent, accurate and that rules are in force to keep users from doing anything to undermine data integrity. But whether you’re trying to ensure data quality going forward or clean up a whole history of sloppy data, the choices in the marketplace can be bewildering.
Cleaning up databases so they can be plumbed for insights, starts with the basics, said Tyler Kleykamp, Connecticut’s state chief data officer. “The first issue is always inconsistency,” he said. “Maybe there are misspellings or format problems. Sometimes a field is entered in all caps and sometimes it is not. These seem like trivial issues to the person entering the data, but they become a problem when you try to use the data.”
Expediency can also cause problems. What if a regulatory change requires operators to begin identifying individuals’ gender when that hadn’t done so before? How that change is implemented will have long-term implications for how the data can be used. If the gender information is in a new field, it can be added for individual entries over time. But if agency managers take a shortcut, such as repurposing an unused field originally intended for some other purpose, the result could be troublesome.
At the Nuclear Regulatory Commission (NRC), a legacy system built over many years to fulfill multiple purposes contains data describing a few thousand licensees. Now, as part of its Reactor Program System modernization effort, the agency wants to extract, clean and build up the data into a manageable, standards-based and shareable database.
“As we set up the interfaces, we are doing the analysis, looking for duplicates, looking for empty fields,” said Cris Brown, NRC master data management program manager. It’s labor-intensive work. “You have to go back and ask the expert in the office: ‘What should this be, really?’ Then you can write a rule around that.”
To expedite the effort, Brown said, “We have set up data stewards – business people in various offices who can tell us what the information is supposed to look like.”
The technology market research firm Gartner said organizations increasingly identify such roles within the business sides of their operations in recognition that data quality is less an information technology problem than a business process matter.
“Key roles such as data steward, data quality champion, data quality analyst and data owner are more often either on the business side or a hybrid of business and IT roles,” Garter analysts Saul Judah and Ted Friedman wrote in a November 2015 report. “This indicates greater information maturity in the market and an increasing recognition that ensuring data quality requires organizational collaboration.”
More broadly, Judah and Friedman see this as an indication that over time, database management work will migrate from IT back offices to “self-service capabilities for data quality.” And some vendors are already developing products with that in mind.
Brown and others like her spend much of their time cleaning up data rather than helping analyze it. Sometimes files are corrupted, sometimes metadata is incomplete or locked in a proprietary format.
For outsiders face the same challenges trying to extract insights from government data.
Data advocacy group Open Knowledge International recently conducted an extensive review of government data related to procurements. “We knew no data set would be perfect, but it is worse than we expected,” said Community Manager and Open Data for Development Program Manager Katelyn Rogers. “You can go through government data sets and names will have different forms within a single file. Nothing matches with anything. We will get data sets where big portions are missing. Instead of covering an entire procurement, it only covers 25 percent of the information about that procurement.”
It’s not that government agencies don’t want clean data – they do. At some, like USAID, clean data is even a critical organizational goal. But data quality problems often don’t arise until someone introduces a need or question not asked before. Elizabeth Roen, senior policy analyst in USAID’s Office of Learning, Evaluation and Research, said USAID is looking to outsiders to help identify those holes. “One of the things we are hoping will happen is that when third parties start using this data, that they will alert us to where there are issues,” she said.
“Finally, users need to be concerned with deliberate efforts to disguise data” said Jala Attia, senior program director of General Dynamics Information Technology’s Health Care Program Integrity Solutions Group. “In healthcare, for example, fraudulent claims use multiple variations on a person’s name. So Al Capone could be listed as Al Capone, Alphonse Capone, Alphonse G. Capone, Al G. Capone, Al Gabriel Capone, A. Gabriel Capone, or A.G. Capone. With advanced analytics tools, we can identify and reconcile some of these. But there’s still work to be done before the data is completely reliable.”
Careful How You Do That
Creating clean, structured data bases begins with good processes, according to the Center for Open Data Enterprise, a non-profit based in Washington, D.C. In April, the organization co-hosted a roundtable on the quality of government databases with the White House Office of Science and Technology Policy. In a summary of that meeting, the group listed strategies for improving the quality of data in government databases:
- Address human factors to ensure data is formatted to meet end-user needs
- Strengthen data governance to ensure integrity in data collection, management and dissemination
- Establish effective feedback systems so that users can help to identify and eliminate data quality issues
- Institute improved data policies such as the Information Quality Act and ISO 8000, which set out quality requirements for open government data
For agencies, facing up to dirty data and figuring out where to start can be daunting. Connecticut’s Kleykamp takes a pragmatic approach: Start by putting the data to work and sharing it – both internally and externally. Then wait to see what holes emerge. From there he said, data managers can prioritize the work that needs to be done.
“You have to start using [data] in bulk,” he said. “You never will figure out what the issues are until you try to answer a question with the data or do something with it that you aren’t currently doing. Nine times out of 10, that’s how you are going to find out where the issues lie.”