INTRODUCING DATA SCIENCE
"Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data."
Dhar, V. (2013)
Data Set is a collection of data.
Mode is the most common number in a data set.
Mean is the average of a data set, it can be very misleading.
Median is the middle number of the data set.
Range is the difference between the lowest and highest values.
Standard Deviation is a measure of how spread out numbers are.
Confidence Interval is the range of values we are fairly sure our true value is in.
Data Set Integrity
Making decisions from poor quality data is like building castles on sand. We can use some of the descriptive statistics above to get a sense of the set, but beyond that we need to really see if we are working from a strong foundation.
If the data set already exists,
Validity: Does the data meet defined business boundaries?
Accuracy: Is it a logical value? Can it be verified from source?
Consistency: Were entries categorised correctly/consistently?
Traceability: Access to source of the data, and use it for checking?
Timeliness: Is the data up to date?
It is crucial that processes around data capturing are standardised to minimise the amount of errors upstream as much as possible. Large efforts here pay dividends in all the work that follows.
A Flow from Data to Decisions:
1. The Question frames the why of the work to follow.
2. Data Acquisition serves the why and ideally be continuous data.
3. Data Preparation is formatting & verifying the data integrity before analysis
4. Data Analysis uses different tools to generate insights
5. The Decision can now be made with new perspectives
Types of Data:
Discrete data is countable while Continuous Data is measurable. Continuous data contains more information as within the measurement, the points themselves can be counted. Therefore, continuous data can be viewed as 'raw' data.
Randomised Control Experiments ↓
Regression Analysis ↓
Statistical Significance ↓
Hypothesis Testing ↓
1. Confidence Intervals (replacing averages)
Making decisions based on averages can be misleading. Outliers or incorrect sampling may skew the results. A Confidence Interval is a range of values we are fairly sure our true value lies in. So, instead of saying the mean is (x), calculate confidence intervals and present it as: "the mean is between (x) & (y) with a confidence of (z) %"
2. Box Plots (replacing bar charts)
Box plots are for displaying quantitative data, because they deal with medians, quartiles, and stuff related to actual quantities. Bar graphs are for displaying categorical/qualitative data because they deal with counts.
As you can see from the above, if only the coloured lines were displayed, we would not see that the data on the green was fairly consistent, and the data in the blue was much more spread apart. This is a useful tool to get a peek at the raw data as we convert them to visualisations.