The Subjective Nature of Data: Random Thoughts
As I wrote the last two blog posts on The Subjective Nature of Data, there were several other points about data that came to my mind.
Here they are:
As I wrote the last two blog posts on The Subjective Nature of Data, there were several other points about data that came to my mind.
Here they are:
Be suspicious of numbers; be suspicious of percentages
Percentages and numbers can give data clarity, or they can be very misleading.
Percentages on small numbers are. . . ridiculous. They give disproportionate importance to data categories and skew perspective. In the early days of ecommerce, reports would come out each year claiming 100 – 200% market growth. Yes, last year’s 25 merchants had grown to 75 merchants – 150% market growth.
For small numbers, numbers are far more accurate and meaningful than percentages.
However, when numbers get too cumbersome for easy visualization, percentages are useful to show relative size of the different components of the data. Pie charts can give a quick visualization of market share split between competitors. Bar graphs can show the ebb and flow of transaction volume. Percentages can be very useful.
Numbers can be equally misleading. Annual reports are renowned for their use and abuse of numbers. I remember a professor telling us, “Always read the footnotes in annual reports – that is where the real data hides.” Big positive numbers, like sales volume, sound so impressive in the annual report – especially when they are NOT put next to the financial loss per sale documented in footnotes in the appendix of the same annual report.
For numbers to be honest, the good, the bad, and the ugly need equal prominence.
In data, correlation is fairly easy to find;
causation is impossible to find
If data is analyzed correctly, correlations between the data components emerge. Repeating patterns show up as the data has a rhythmic ebb and flow.
Some data correlations are just chance – they don’t have significance. Other correlations do suggest that something outside of the data may be a cause of the correlation.
There is a danger in assigning causation to a data correlation.
There can be a number of influences that can bring about the same data correlation – and single causes that can have very different effects on the data (i.e. no repeating correlation). There is nothing in the data correlation that can point to a specific cause. Therefore, causation is a guess or a probability – nothing more. Even if I remove the probable cause and the data responds differently, I cannot prove this same cause was the reason for the previous correlations. I only proved it was an influence on the data now.
So, identify the data correlations, but be humble and not emphatic with possible or probable causes for those correlations.
Source data: know its roots, its purpose, its audience
Data has roots. It comes from somewhere. It doesn’t just appear. For most of us, the data we have access to has been collected by others and possibly for other purposes (more about that in a moment). I can only know the relative value of the data I handle if I know where that data came from:
· What mechanisms collected the data?
· What purpose was in mind when the data was filtered?
· Is this source data, or has it been compiled from one or more other sources?
Knowing the roots of the data lets me know the utility of that data. If I don’t know the roots of the data, I cannot know if it can be useful for my purposes.
Collected data has purpose. There is always a purpose framework around data collection and compilation. Nobody collects data just to collect data. Was this data:
· Collected for a marketing team to show product performance?
· Compiled to anonymize it for use in a fee-based research engine?
· Aggregated for a specific set of reports?
Unless I know the purpose of the data I am accessing, I may attempt to extrapolate conclusions the data was never meant to support.
Data is collected for a specific audience. Data collection, filtering and presentation is always shaped for a target audience.
I remember looking for data on the performance of contactless cards at Automated Fuel Dispensers. As I pored through the data warehouse aggregated data, I found that my data query results did not match what we were seeing day-to-day in our support roles.
I finally asked the database support team what target audience the aggregation had been created for. It had been created for executives, and it focussed on the sales side of the business. I was definitely not that audience – therefore, the aggregated data was useless to me.
Once I found the data warehouse universe for the technical development teams, I was able to run queries that confirmed and dimensioned the challenges we were seeing day-to-day. Their data requirements and mine aligned, so we could share the same databases.
Data aggregation is a “Frienemy”
I have already mentioned “data aggregation” a couple of times in this blog post. It really can be a best friend or an awful enemy!
In my example above (the gas station query), data aggregation had grouped the data I needed with other “similar” (for the target audience) data. For them, yes, it made sense. For me, it skewed the data so that each of my queries returned misleading results.
There have been other times where aggregation proved incredibly useful.
I used to pull monthly reports on cardholder spend across merchants in Canada. The data, as pulled, showed one merchant in a certain merchant category as having the lion’s share of the cardholder spend in that merchant category – much larger than any other merchant in the report.
In doing some other research, I found that one of their competitors had almost two dozen “doing business as” names – unique business names all owned by this one merchant. I aggregated all those “doing business as” names under one umbrella name for that merchant and – suddenly - the “biggest merchant” was now the second largest merchant on my report! Correct aggregation had changed our view of the Canadian merchant market.