Key questions to start your data analysis projects


What business problem do you think you are trying to solve?

This may seem obvious, but lots of people fail to ask it before jumping in. Notice here how I qualified the first question with “do you think?”. Sometimes the root cause of a problem is not what we think it is; in other words, this is often not what we initially think.

In any case, you do not have to solve the whole problem at once by trying to boil the sea. In fact, you should not take this approach. Project methodologies (such as agile) allow organizations to take an iterative approach and embrace the power of small batches.

What types and data sources are available to you?

To be sure, most, if not all, organizations store huge amounts of company data. Looking at internal databases and data sources makes sense. However, do not make the mistake of thinking that the discussion ends there.

External data sources in the form of open data sets (such as data.gov) continue to spread. There are easy ways to retrieve data from the Internet and get it back in a usable format – scraping, for example. These tactics may work well in academic environments, but scraping can be a sign of data maturity for businesses. It is always best to get your hands on the original data source whenever possible.

warning: Just because the organization stores doesn’t mean you can easily access it. Pernicious internal politics stifles many for an analytical endeavor.

What types and sources of data are you allowed to use?

With all the ship of privacy and security these days, foolish is the soul that fails to ask this question. As some retail executives have learned in recent years, a business can fully comply with the law and still make people feel decidedly icky about privacy when purchasing. Or consider a health care organization – it may not technically violate the 1996 Healthcare Transportation and Accountability Act (HIPAA), but it can still raise privacy concerns. Another example is GDPR. Adhering to this regulation means that organizations will not necessarily be able to use personal data they could use earlier – at least not in the same way.

What is the quality of your organization’s data?

Common errors here include assuming that your data is complete, accurate, and unique (read: non-duplicate). During my consulting career, I was able to count on one hand the number of times a client provided me with a “perfect” dataset. While it’s important to clean your data, you don’t need pristine data just to get started. As Voltaire said, “Perfect is the good enemy.”

What tools are available to extract, purify, analyze and present the data?

This is 2018, not 1998. Please do not tell me that your analysis efforts are limited to spreadsheets.

Sure, Microsoft Excel works with structured data – if the data set is not that large. But make no mistake: Everyone’s favorite spreadsheet program suffers from a lot of restrictions, in areas like:

  • Handling of semi-structured and unstructured data.
  • Tracking changes / version control.
  • Managing Size Limitations.
  • Securing Governance.
  • Security.

For now, suffice it to say that if you try to analyze large, complex data sets, there are many tools worth exploring. The same goes for visualization. Never before have we seen such a range of powerful, affordable and user-friendly tools designed to present data in interesting ways. E.g SAS® Visual analysis. SAS Visual Data Mining and Machine Learning and more open source tools are just some applications and frameworks that make dataviz powerful and dare I say cool.

Warning 1: While software vendors often monkey each other’s features, don’t assume that each application can do everything the others can.

Warning 2: With open source software, remember that “free” software is comparable to a “free” puppy. To be direct: Even with open source software, you can expect to spend some time and effort on education and training.



Source link