Home / Solutions / Business Intelligence / Articles
A Beginner’s Guide to Data Science
We hear a lot about Data Science nowadays. It is widely adopted by businesses to gain insights, forecasts, predictive, and prescriptive analytics, etc. But do we understand what data science is? In this article, we would attempt to dissect the new-found love of the IT industry.
What is Data Science?
Data science, as described by Frank Lo, “is all about uncovering findings from data. Diving in at a granular level to mine and understand complex behaviors, trends, and inferences. It's about surfacing hidden insights that can enable companies to make smarter business decisions.” Data science assists businesses in discovering the patterns in data with a variety of statistical algorithms. It uses smart tools such as SAS, R, Python, SPSS, etc. to assist in such discoveries. Data science in many parts is closely associated with Big Data.
Skills of a Data Scientist
The practitioners of data science are called data scientists and job is termed as one of the trending jobs of the 21st century. An expert data scientist must exhibit three primary skills as shown in the below image
Computer Science: Data scientists use a variety of programming languages and software packages to flexibly and efficiently extract, clean, analyze, and visualize data. Though there are always new tools in the rapidly changing world of data science, a few have stood the test of time such as R, Python, and SAS.
Domain Knowledge: Data scientists are needed in nearly every industry. As the availability of data grows, so do the applications. Data science is no longer a field limited to businesses from technology and the financial domain. Each industry has its unique goals, datasets, and constraints. An expert data scientist can understand the unique requirements of the industry and apply their skills to bring up valuable data. Though some metrics, such as profit and conversions, remain constant across all industries. However, many key performance indicators (KPIs) are highly specialized. This data makes up the business intelligence specific to an industry and can be used to understand where the business is and the historical trends that have taken it there.
Maths and Statistics: Software runs all the necessary statistical tests these days, but a data scientist still needs to possess the statistical sensibility to know which test to run and under what circumstances. A good understanding of multivariable calculus and linear algebra, which form the basis of many data analysis techniques, is likely to allow a data scientist to build in-house implementations of analysis routines as needed. An understanding of statistical theorems helps data scientists to understand the capabilities and limitations or assumptions of these techniques. A data scientist should understand the assumptions that need to be met for each statistical test.
Algorithms involved in Data Science
The next step after understanding Data Science and the skills involved in this trade is to know the categories of algorithms. There are two major classes of algorithms – Supervised and Unsupervised learning. What do these algorithms comprise of? Let’s see.
Supervised learning is a technique wherein the data scientist teaches or trains the machine using data which is well labeled. Such data is already tagged with the correct answers. Following this, the machine is provided with a new set of examples (data) so that the supervised learning algorithm analyzes the training data (set of training examples) and produces a correct outcome from labeled data.
Examples: Suppose you are given a basket filled with different kinds of fruits. Now, the first step is to train the machine with all different fruits one by one.
If the color of an object is Red and has rounded shape with a depression at the top then it will be labeled as – Apple.
If the color of an object is Green-Yellow and its shape is long curving cylinder then it will be labeled as – Banana.
Now, suppose after training the machine, you give it a new fruit, say Banana, from a basket and ask to identify it. Since the machine has already learned things from previous examples, so this time it has to be used wisely. It will first classify the fruit by its color and shape to confirm the fruit name as BANANA and put it in the Banana category. Thus, the machine learns things from training data (basket containing fruits) and then applies the knowledge to test data (new fruit).
Classifications of Supervised learning: It can be classified into two categories of algorithms. These are:
Classification: A classification problem is when the output variable is a category, such as “Red” or “Blue”, “Disease” or “No Disease”, etc.
Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.
Unsupervised learning is the training of a machine using information that is neither classified nor labeled thereby allowing the algorithm to act on a piece of information without guidance. The task of the machine is to group unsorted information according to similarities, patterns, and differences without any prior training of data. Unlike supervised learning, no prior teaching is provided in this process. Therefore, the machine is restricted to finding the hidden structure in unlabeled data by itself.
Examples: Suppose the machine is given an image having dogs and cats, which it has have not seen before. Thus, the machine has no idea about the features of dogs and cats. So, it has to categorize them by similarities, patterns, and differences i.e., the machine can easily categorize it into two parts. First may contain all pics having dogs in it and the second part may contain all pics having cats in it. Here the machine was not fed with any prior examples or training data. Classifications of Unsupervised learning: It can be classified into two categories of algorithms. These are:
Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.
This article was intended to provide knowledge about data science, the skills required and categorizing the algorithms into supervised and unsupervised learning. In future articles, we will learn further details of supervised and unsupervised learning.