The age of big data has seen a host of new techniques for analysing large data sets. Before any of those techniques can be applied however, the target data has to be aggregated, organised, and cleaned up.
An international team of computer scientists hopes to simplify this process, with a new system called Data Civilizer, which automatically finds connections among many different data tables and allows users to perform database-style queries across all of them. The results of the queries can then be saved as new, orderly data sets that may draw information from dozens or even thousands of different tables.
“Modern organizations have many thousands of data sets spread across files, spreadsheets, databases, data lakes, and other software systems,” said Sam Madden, an MIT professor of electrical engineering and computer science and faculty director of MIT’s bigdata@CSAIL initiative.
“Civilizer helps analysts in these organisations quickly find data sets that contain information that is relevant to them and, more importantly, combine related data sets together to create new, unified data sets that consolidate data of interest for some analysis.”
Pairs and permutations
The Data Civilizer system begins by analyzing every column of every table at its disposal. First, it produces a statistical summary of the data in each column. For numerical data, that might include a distribution of the frequency with which different values occur; the range of values; and the “cardinality” of the values, or the number of different values the column contains. For textual data, a summary would include a list of the most frequently occurring words in the column and the number of different words. Data Civilizer also keeps a master index of every word occurring in every table and the tables that contain it.
Then the system compares all of the column summaries against each other, identifying pairs of columns that appear to have commonalities — similar data ranges, similar sets of words, and the like. It assigns every pair of columns a similarity score and, on that basis, produces a map, rather like a network diagram, that traces out the connections between individual columns and between the tables that contain them.
Tracing a path
A user can then compose a query and, on the fly, Data Civilizer will traverse the map to find related data. Suppose, for instance, a pharmaceutical company has hundreds of tables that refer to a drug by its brand name, hundreds that refer to its chemical compound, and a handful that use an in-house ID number. Now suppose that the ID number and the brand name never show up in the same table, but there is at least one table linking the ID number and the chemical compound, and one linking the chemical compound and the brand name. With Data Civilizer, a query on the brand name will also pull up data from tables that use just the ID number.
Some of the linkages identified by Data Civilizer may turn out to be spurious. However, the user can discard data that does not fit a query while keeping the rest. Once the data has been pruned, the user can save the results as their own data file.
“Data Civilizer is an interesting technology that potentially will help data scientists address an important problem that arises due to the increasing availability of data — identifying which data sets to include in an analysis,” said Iain Wallace, a senior informatics analyst at the drug company Merck.
“The larger an organization, the more acute this problem becomes.
“We are currently exploring how to use Civilizer as a harmonization layer on top of a variety of chemical-biology datasets.
“These datasets typically link compounds, diseases, and targets together. One use case is to identify which table contains information about a specific compound and what additional information is available about that compound in other related datasets. Civilizer helps us by allowing full text search over all the columns and then identifying related columns automatically. By using Civilizer, we should be easily able to add additional data sources and update our analysis very quickly.”