Intro to Data Science

Full Disclaimer I am not a Data Scientist and these are just my getting started notes.

Introduction to Data Science

According to wikipedia Data Science can be summarized as follows:

Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. Data science is the same concept as data mining and big data: “use the most powerful hardware, the most powerful programming systems, and the most efficient algorithms to solve problems”.

Important subjects to grasp about Data Science

It is important to have knowledge about statistics and be well versed in Mathematics to be an effective Data Science from what I have read.

Math Prerequisites for Data Science

Linear Algebra

According to Wolfram Alpha:

Linear algebra is the study of linear sets of equations and their transformation properties. Linear algebra allows the analysis of rotations in space, least squares fitting, solution of coupled differential equations, determination of a circle passing through three given points, as well as many other problems in mathematics, physics, and engineering. Confusingly, linear algebra is not actually an algebra in the technical sense of the word “algebra” (i.e., a vector space V over a field F, and so on).

Single Variate and Multi Variate Calculus

According to Wikipedia:

Calculus, originally called infinitesimal calculus or “the calculus of infinitesimals”, is the mathematical study of continuous change, in the same way that geometry is the study of shape and algebra is the study of generalizations of arithmetic operations.

It has two major branches, differential calculus and integral calculus. Differential calculus concerns instantaneous rates of change and the slopes of curves. Integral calculus concerns accumulation of quantities and the areas under and between curves. These two branches are related to each other by the fundamental theorem of calculus. Both branches make use of the fundamental notions of convergence of infinite sequences and infinite series to a well-defined limit.


Statistics is a branch of mathematics working with data collection, organization, analysis, interpretation and presentation. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as “all people living in a country” or “every atom composing a crystal”. Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

Data Science Concepts and Terminology

Artificial Intelligence

According to Wikipedia:

In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans. Colloquially, the term “artificial intelligence” is often used to describe machines (or computers) that mimic “cognitive” functions that humans associate with the human mind, such as “learning” and “problem solving”.

Machine Learning

According to Wikipedia:

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use in order to perform a specific task effectively without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in a wide variety of applications, such as email filtering, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.In its application across business problems, machine learning is also referred to as predictive analytics.

Supervised Learning

According to Wikipedia:

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way (see inductive bias).

Unsupervised Learning

According to Wikipedia:

Unsupervised learning is a type of self-organized Hebbian learning that helps find previously unknown patterns in data set without pre-existing labels. It is also known as self-organization and allows modeling probability densities of given inputs. It is one of the main three categories of machine learning, along with supervised and reinforcement learning. Semi-supervised learning has also been described, and is a hybridization of supervised and unsupervised techniques.

Neural Network

According to Wikipedia:

A neural network is a network or circuit of neurons, or in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus a neural network is either a biological neural network, made up of real biological neurons, or an artificial neural network, for solving artificial intelligence (AI) problems. The connections of the biological neuron are modeled as weights. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed. This activity is referred as a linear combination. Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be −1 and 1.

Natural Language Processing

According to Wikipedia:

Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Deep Learning

According to Wikipedia:

Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on artificial neural networks. Learning can be supervised, semi-supervised or unsupervised.

Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts.

Data Science Tools

The following blog post in the Towards Data Science blog proved helpful to me

It seems that Python and the R Programming Languages are most popular for data scientists.

Python Programming Language


According to Wikipedia:

Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python’s design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.

R Programming Language

R Programming

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

Jupyter Notebooks

Jupyter Notebooks

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

Install Jupyter Notebook with Pip
pip install jupyterlab

git clone

## Run Jupyter Notebook
cd notebooks
Running Jupyter Notebook

Here is a screenshot where I run a specific notebook:

Notice here that in my machine a browser was opened and the Jupyter Notebook is running on port 8888 and here is the following url: http://localhost:8888/notebooks/notebooks/01.00-IPython-Beyond-Normal-Python.ipynb*

Python Pandas


pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Other Resources for getting started with Data Science

Check out the Towards Data Science Blog and look at some udemy and udacity courses.

Data Science Cheat Sheet Github

Data Science Cheat Sheet

Contact Information

If you like this post then please follow me at jbelmont @ github and jbelmont80 @ twitter for more related type posts.

Feel free to leave a comment if you like.

Until Next Time :)

comments powered by Disqus