One of the most important activities in the new world of self-service analytics is making sure that data sets used for analysis and visualization are correct. Cleansing data was formerly a process delegated to Extract, Transform and Load (ETL) developers. Now, with tools like Oracle Data Visualization (DV), any end user can upload a file and perform the steps required to ensure a clean, consistent data set.
Oracle DV, whether as part of the Oracle Analytics Cloud (OAC) or Oracle Data Visualization Desktop (DVD) version, offers a variety of options for transforming data. The Data Flow functionality provides a step-based process for merging, refining, and enhancing data to generate new data sets. Before the data transformation process (especially for a new data set from an unknown source), it can be useful to simply profile the contents of the data to identify bad data values, to find opportunities for binning data together, or to develop a high level of confidence in the validity of the data set.
Oracle DV offers a profiling capability built into its Data Set processing function, but it is almost a hidden feature unless the user knows where to look.
We’ll start with a data set that has been uploaded into OAC DV that contains donation data for teachers and schools.
If this were a new data set that a business analyst had never seen before, it would be helpful for the analyst to get an understanding of what the various column values were; the cardinality (uniqueness) of the data values; the number of null values; and any possible data anomalies that might affect the visualizations that could be applied.
Start by clicking on the “Create Project” button in the upper right corner.
This brings up an empty Data Visualization canvas.
To profile the data, first click on the “Prepare” button at the top of the screen.
This will cause the “Data Preparation” screen to be displayed for the data set. On this screen, there are many options to transform the columns either by applying functions, binning, or creating new columns.
For this example, we will look at the “hidden” profiling option. It’s called “Data Tiles” and it can be found at the top of the screen on the dropdown with a grid icon:
Select “Data Tiles” and wait for the magic to happen! For every attribute column in the data set, a chart or graph will be generated that by default does a count of rows for the top ten values in that column. Note that the time it takes to generate the charts will depend on the number of attribute columns and the number of rows in the entire data set.
Here’s what the initial “Data Tiles” screen looks like for the donation data set:
As you can see, each attribute column in the data set now has its own chart that shows a record count by the unique values in the column. For columns that have more than ten unique values, only the first ten are displayed and the rest are grouped into a category called “OTHER”.
Also, the type of chart or graph displayed will depend on the data type of the attribute column. For text columns, a horizontal bar chart is displayed. For date columns, a line chart with the data values as the X axis is displayed. For numeric attribute columns, a vertical bar chart is generated.
At this point, the business analyst can examine each of the data set column charts to determine if the values are acceptable or require additional transformation.
There are additional options on the “Data Tiles” screen that can be used to further the data profiling. Rather than show the row counts by each column value, there is a dropdown available to change the charts to use any of the columns that have been identified as metrics in the data set. So, for the donation data set, the dropdown includes choices for “STUDENTS_REACHED,” “TOTAL_DONATIONS,” and “NUM_DONORS.”
Choosing “TOTAL_DONATIONS” returns a different set of charts to be displayed for each attribute column with the results sorted by the total dollar amount for each value in that column.
Two other options exist to modify the “Data Tile” charts. Clicking on the icon next to the “View By” dropdown reveals checkboxes to turn on and off whether the charts show null values or whether the charts show the “OTHER” category grouping for values outside the top ten.
Using the “Data Tiles,” the analyst can review each column and determine if there are any bad data values or if some values can be binned together to improve the analysis. These may also identify outliers in the data set that could negatively impact the results. Those outliers could be filtered out either in the DV project or through the use of the DV “Data Flows” function to generate a new data set.
In conclusion, as business analysts become more engaged in the process of acquiring data for self-service analysis, using the “Data Tiles” functions can be very helpful in making sure the data is accurate and useful.