Data Science Course

Data Science Course

Educational systems are under increasing pressure to reduce costs while maintaining or
improving outcomes for students. To improve educational productivity,1
In the United States, online learning alternatives are proliferating rapidly. Recent estimates
suggest that 1.5 million elementary and secondary students participated in some form of
online learning in 2010 (Wicks 2010). The term online learning can be used to refer to a

Data Science Course

wide range of programs that use the Internet to provide instructional materials and facilitate
interactions between teachers and students and in some cases among students as well.
Online learning can be fully online, with all instruction taking place through the Internet, or
online elements can be combined with face-to-face interactions in what is known as blended
learning (Horn and Staker 2010).
many school

Data Science Course
districts and states are turning to online learning.
The purpose of this report is to support educational administrators and policymakers in
becoming informed consumers of information about online learning and its potential impact
on educational productivity. The report provides foundational knowledge needed to examine
and understand the potential contributions of online learning to educational productivity,
including a conceptual framework for understanding the necessary components of rigorous
productivity analyses, drawing in particular on cost-effectiveness analysis as an accessible
method in education. Five requirements for rigorous cost-effectiveness studies are described:
1) Important design components of an intervention are specified;
2) Both costs and outcomes are measured;

Data Science Course

1 As defined in this report, productivity is a ratio between costs and outcomes that can be improved in one of three ways: by
reducing costs while maintaining outcomes, improving outcomes while maintaining costs or transforming processes in a
way that both reduces costs and improves outcomes. Any improvements in productivity are likely to require initial
investments, but successful efforts reduce costs over the long term, even after these initial investments are taken into
3) At least two conditions are compared;
4) Costs and outcomes are related using a single ratio for each model under study;
5) Other factors not related to the conditions being studied are controlled or held
The report also includes a review of ways that online learning might offer productivity
benefits compared with traditional place-based schooling. Unfortunately, a review of the
available research that examined the impact of online learning on educational productivity
for secondary school students was found to be lacking. No analyses were found that
rigorously measured the productivity of an online learning system relative to place-based

Data Science Course

instruction in secondary schools.2
Given the limitations of the research regarding the costs and effects of online instruction for
secondary students, the review that follows also draws on examples and research about the
use of online learning for postsecondary instruction. While there are many differences
between higher education and elementary and secondary education (e.g., age and maturity of
students), postsecondary institutions have a broader and longer history with online learning
than elementary and secondary schools. The intention is to use the literature from higher
education to illustrate concepts that may apply to emerging practices in elementary and
secondary education. Findings from the studies of higher education should be applied with
caution to secondary education, as student populations, learning contexts and financial
models are quite different across these levels of schooling.
This lack of evidence supports the call of the National
Educational Technology Plan (U.S. Department of Education 2010a) for a national initiative
to develop an ongoing research agenda dedicated to improving productivity in the education

Data Science Course

sector. The evidence summarized in this report draws on literature that addressed either costs
or effectiveness. These studies typically were limited because they did not bring the two
together in a productivity ratio and compare results with other alternatives.
While rigorously researched models are lacking, the review of the available literature
suggested nine applications of online learning that are seen as possible pathways to
improved productivity:

2 Two research reports—an audit for the Wisconsin State Legislature (Stuiber et al. 2010) and a study of the Florida Virtual
School (Florida Tax Watch Center for Educational Performance and Accountability 2007)—include data about costs and
effects. These reports suggest that online learning environments may hold significant potential for increasing educational
productivity. Both found that online learning environments produced better outcomes than face-to-face schools and at a
lower per-pupil cost than the state average. However, these conclusions must be viewed cautiously because both reports
lacked statistical controls that could have ruled out other explanations of the findings.

Data Science Course

1) Broadening access in ways that dramatically reduce the cost of providing access to
quality educational resources and experiences, particularly for students in remote
locations or other situations where challenges such as low student enrollments make
the traditional school model impractical;
2) Engaging students in active learning with instructional materials and access to a
wealth of resources that can facilitate the adoption of research-based principles and
best practices from the learning sciences, an application that might improve student
outcomes without substantially increasing costs;
3) Individualizing and differentiating instruction based on student performance on
diagnostic assessments and preferred pace of learning, thereby improving the
efficiency with which students move through a learning progression;
4) Personalizing learning by building on student interests, which can result in
increased student motivation, time on task and ultimately better learning outcomes;
5) Making better use of teacher and student time by automating routine tasks and
enabling teacher time to focus on high-value activities;
6) Increasing the rate of student learning by increasing motivation and helping
students grasp concepts and demonstrate competency more efficiently;
7) Reducing school-based facilities costs by leveraging home and community spaces
in addition to traditional school buildings;
8) Reducing salary costs by transferring some educational activities to computers, by
increasing teacher-student ratios or by otherwise redesigning processes that allow for
more effective use of teacher time; and
9) Realizing opportunities for economies of scale through reuse of materials and their
large-scale distribution.
It is important to note that these pathways are not mutually exclusive, and interventions
intended to increase productivity usually involve multiple strategies to impact both the
benefit side (pathways 1–4) and cost side (pathways 5–9).
Determining whether online learning is more or less cost-effective than other alternatives
does not lend itself to a simple yes or no answer. Each of the nine pathways suggests a
plausible strategy for improving educational productivity, but there is insufficient evidence
to draw any conclusions about their viability in secondary schools. Educational stakeholders
at every level need information regarding effective instructional strategies and methods for
improving educational productivity. Studies designed to inform educational decisions should
follow rigorous methodologies that account for a full range of costs, describe key
implementation characteristics and use valid estimates of student learning.
Even less is known about the impact of online learning for students with disabilities.
Regarding potential benefits, the promise of individualized and personalized instruction
suggests an ability to tailor instruction to meet the needs of students with disabilities. For
example, rich multimedia can be found on the Internet that would seem to offer ready
inspiration for meeting the unique needs of the blind or the hearing impaired. In fact,
standards for universal design are available both for the Web and for printed documents. In
addition, tutorial models that rely on independent study are well suited to students with
medical or other disabilities that prevent them from attending brick-and-mortar schools.
However, while online learning offerings should be made accessible to students with
disabilities, doing so is not necessarily cheap or easy.
Any requirement to use a technology, including an online learning program, that is
inaccessible to individuals with disabilities is considered discrimination and is prohibited by
the Americans with Disabilities Act of 1990 and Section 504 of the Rehabilitation Act of
1973, unless those individuals are provided accommodations or modifications that permit
them to receive all the educational benefits provided by the technology in an equally
effective and equally integrated manner. The degree to which programs make such
accommodations is not yet known. To address this need, the U.S. Department of Education
recently funded the Center on Online Learning and Students With Disabilities, a five-year
research effort to identify new methods for using technology to improve learning. Similarly,
research regarding the degree to which current online learning environments meet the needs
of English language learners and how technology might provide a cost-effective alternative
to traditional strategies is just emerging.
The realization of productivity improvements in education will most likely require a
transformation of conventional processes to leverage new capabilities supported by
information and communications technologies. Basic assumptions about the need for seat
time and age-based cohorts may need to be reevaluated to sharpen focus on the needs and
interests of all students as individuals. And as a rigorous evidence accumulates around
effective practices that may require institutional change, systemic incentives may be needed
to spur the adoption of efficient, effective paths to learning.

Implications for Online Learning 1. Strategies should be used to allow learners to perceive and attend to the information so that it can be transferred to working memory. Learners use their sensory systems to register the information in the form of sensations. Strategies to facilitate maximum sensation should be used. Examples include the proper location of the information on the screen, the attributes of the screen (color, graphics, size of text, etc.), the pacing of the information, and the mode of delivery (audio, visuals, animations, video). Learners must receive the information in the form of sensations before perception and processing can occur; however, they must not be overloaded with sensations, which could be counterproductive to the learning process. Non-essential sensations should be avoided to allow learners to attend to the important information. Strategies to promote perception and attention for online learning include those listed below. 10 Theory and Practice of Online Learning • Important information should be placed in the center of the screen for reading, and learners must be able to read from left to right. • Information critical for learning should be highlighted to focus learners’ attention. For example, in an online lesson, headings should be used to organize the details, and formatted to allow learners to attend to and process the information they contain. • Learners should be told why they should take the lesson, so that they can attend to the information throughout the lesson. • The difficulty level of the material must match the cognitive level of the learner, so that the learner can both attend to and relate to the material. Links to both simpler and more complicated materials can be used to accommodate learners at different knowledge levels. 2. Strategies should be used to allow learners to retrieve existing information from long-term memory to help make sense of the new information. Learners must construct a memory link between the new information and some related information already stored in long-term memory. Strategies to facilitate the use of existing schema are listed below. • Use advance organizers to activate an existing cognitive structure or to provide the information to incorporate the details of the lesson (Ausubel, 1960). A comparative advance organizer can be used to recall prior knowledge to help in processing, and an expository advance organizer can be used to help incorporate the details of the lesson (Ally, 1980). Mayer (1979) conducted a meta-analysis of advance organizer studies, and found that these strategies are effective when students are learning from text that is presented in an unfamiliar form. Since most courses contain materials that are new to learners, advance organizers should be used to provide the framework for learning. • Provide conceptual models that learners can use to retrieve existing mental models or to store the structure they will need to use to learn the details of the lesson. • Use pre-instructional questions to set expectations and to activate the learners’ existing knowledge structure. Questions presented before the lesson facilitate the recall of existing


Google’s machine intelligence framework is the new hotness right now. And when TensorFlow became installable on the Raspberry Pi, working with it became very easy to do. In a short time I made a neural network that counts in binary. So I thought I’d pass on what I’ve learned so far. Hopefully this makes it easier for anyone else who wants to try it, or for anyone who just wants some insight into neural networks.

What Is TensorFlow?

To quote the TensorFlow website, TensorFlow is an “open source software library for numerical computation using data flow graphs”. What do we mean by “data flow graphs”? Well, that’s the really cool part. But before we can answer that, we’ll need to talk a bit about the structure for a simple neural network.
Binary counter neural network
Binary counter neural network
Basics of a Neural Network

A simple neural network has some input units where the input goes. It also has hidden units, so-called because from a user’s perspective they’re literally hidden. And there are output units, from which we get the results. Off to the side are also bias units, which are there to help control the values emitted from the hidden and output units. Connecting all of these units are a bunch of weights, which are just numbers, each of which is associated with two units.

The way we instill intelligence into this neural network is to assign values to all those weights. That’s what training a neural network does, find suitable values for those weights. Once trained, in our example, we’ll set the input units to the binary digits 0, 0, and 0 respectively, TensorFlow will do stuff with everything in between, and the output units will magically contain the binary digits 0, 0, and 1 respectively. In case you missed that, it knew that the next number after binary 000 was 001. For 001, it should spit out 010, and so on up to 111, wherein it’ll spit out 000. Once those weights are set appropriately, it’ll know how to count.
Binary counter neural network with matrices
Binary counter neural network with matrices

One step in “running” the neural network is to multiply the value of each weight by the value of its input unit, and then to store the result in the associated hidden unit.

We can redraw the units and weights as arrays, or what are called lists in Python. From a math standpoint, they’re matrices. We’ve redrawn only a portion of them in the diagram. Multiplying the input matrix with the weight matrix involves simple matrix multiplication resulting in the five element hidden matrix/list/array.
From Matrices to Tensors

In TensorFlow, those lists are called tensors. And the matrix multiplication step is called an operation, or op in programmer-speak, a term you’ll have to get used to if you plan on reading the TensorFlow documentation. Taking it further, the whole neural network is a collection of tensors and the ops that operate on them. Altogether they make up a graph.
Binary counter’s full graph
layer1 expanded

Shown here are snapshots taken of TensorBoard, a tool for visualizing the graph as well as examining tensor values during and after training. The tensors are the lines, and written on the lines are the tensor’s dimensions. Connecting the tensors are all the ops, though some of the things you see can be double-clicked on in order to expand for more detail, as we’ve done for layer1 in the second snapshot.

At the very bottom is x, the name we’ve given for a placeholder op that allows us to provide values for the input tensor. The line going up and to the left from it is the input tensor. Continue following that line up and you’ll find the MatMul op, which does the matrix multiplication with that input tensor and the tensor which is the other line leading into the MatMul op. That tensor represents the weights.

All this was just to give you a feel for what a graph and its tensors and ops are, giving you a better idea of what we mean by TensorFlow being a “software library for numerical computation using data flow graphs”. But why we would want to create these graphs?
Why Create Graphs?

The API that’s currently stable is one for Python, an interpreted language. Neural networks are compute intensive and a large one could have thousands or even millions of weights. Computing by interpreting every step would take forever.

So we instead create a graph made up of tensors and ops, describing the layout of the neural network, all mathematical operations, and even initial values for variables. Only after we’ve created this graph do we then pass it to what TensorFlow calls a session. This is known as deferred execution. The session runs the graph using very efficient code. Not only that, but many of the operations, such as matrix multiplication, are ones that can be done on a supported GPU (Graphics Processing Unit) and the session will do that for you. Also, TensorFlow is built to be able to distribute the processing across multiple machines and/or GPUs. Giving it the complete graph allows it to do that.
Creating The Binary Counter Graph

And here’s the code for our binary counter neural network. You can find the full source code on this GitHub page. Note that there’s additional code in it for saving information for use with TensorBoard.

We’ll start with the code for creating the graph of tensors and ops.

import tensorflow as tf
sess = tf.InteractiveSession()


We first import the tensorflow module, create a session for use later, and, to make our code more understandable, we create a few variables containing the number of units in our network.

x = tf.placeholder(tf.float32, shape=[None, NUM_INPUTS], name=’x’)
y_ = tf.placeholder(tf.float32, shape=[None, NUM_OUTPUTS], name=’y_’)

Then we create placeholders for our input and output units. A placeholder is a TensorFlow op for things that we’ll provide values for later. x and y_ are now tensors in a new graph and each has a placeholder op associated with it.

You might wonder why we define the shapes as [None, NUM_INPUTS] and [None, NUM_OUTPUTS], two dimensional lists, and why None for the first dimension? In the overview of neural networks above it looks like we’ll give it one input at a time and train it to produce a given output. It’s more efficient though, if we give it multiple input/output pairs at a time, what’s called a batch. The first dimension is for the number of input/output pairs in each batch. We won’t know how many are in a batch until we actually give one later. And in fact, we’re using the same graph for training, testing, and for actual usage so the batch size won’t always be the same. So we use the Python placeholder object None for the size of the first dimension for now.

W_fc1 = tf.truncated_normal([NUM_INPUTS, NUM_HIDDEN], mean=0.5, stddev=0.707)
W_fc1 = tf.Variable(W_fc1, name=’W_fc1′)

b_fc1 = tf.truncated_normal([NUM_HIDDEN], mean=0.5, stddev=0.707)
b_fc1 = tf.Variable(b_fc1, name=’b_fc1′)

h_fc1 = tf.nn.relu(tf.matmul(x, W_fc1) + b_fc1)

That’s followed by creating layer one of the neural network graph: the weights W_fc1, the biases b_fc1, and the hidden units h_fc1. The “fc” is a convention meaning “fully connected”, since the weights connect every input unit to every hidden unit.

tf.truncated_normal results in a number of ops and tensors which will later assign normalized, random numbers to all the weights.

The Variable ops are given a value to do initialization with, random numbers in this case, and keep their data across multiple runs. They’re also handy for saving the neural network to a file, something you’ll want to do once it’s trained.

You can see where we’ll be doing the matrix multiplication using the matmul op. We also insert an add op which will add on the bias weights. The relu op performs what we call an activation function. The matrix multiplication and the addition are linear operations. There’s a very limited number of things a neural network can learn using just linear operations. The activation function provides some non-linearity. In the case of the relu activation function, it sets any values that are less than zero to zero, and all other values are left unchanged. Believe it or not, doing that opens up a whole other world of things that can be learned.

W_fc2 = tf.truncated_normal([NUM_HIDDEN, NUM_OUTPUTS], mean=0.5, stddev=0.707)
W_fc2 = tf.Variable(W_fc2, name=’W_fc2′)

b_fc2 = tf.truncated_normal([NUM_OUTPUTS], mean=0.5, stddev=0.707)
b_fc2 = tf.Variable(b_fc2, name=’b_fc2′)

y = tf.matmul(h_fc1, W_fc2) + b_fc2

The weights and biases for layer two are set up the same as for layer one but the output layer is different. We again will do a matrix multiplication, this time multiplying the weights and the hidden units, and then adding the bias weights. We’ve left the activation function for the next bit of code.

results = tf.sigmoid(y, name=’results’)

cross_entropy = tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(logits=y, labels=y_))

Sigmoid is another activation function, like the relu we encountered above, there to provide non-linearity. I used sigmoid here partly because the sigmoid equation results in values between 0 and 1, ideal for our binary counter example. I also used it because it’s good for outputs where more than one output unit can have a large value. In our case, to represent the binary number 111, all the output units can have large values. When doing image classification we’d want something quite different, we’d want just one output unit to fire with a large value. For example, we’d want the output unit representing giraffes to have a large value if an image contains a giraffe. Something like softmax would be a good choice for image classification.

On close inspection, it looks like there’s some duplication. We seem to be inserting sigmoid twice. We’re actually creating two different, parallel outputs here. The cross_entropy tensor will be used during training of the neutral network. The results tensor will be used when we run our trained neural network later for whatever purpose it’s created, for fun in our case. I don’t know if this is the best way of doing this, but it’s the way I came up with.

train_step = tf.train.RMSPropOptimizer(0.25, momentum=0.5).minimize(cross_entropy)

The last piece we add to our graph is the training. This is the op or ops that will adjust all the weights based on training data. Remember, we’re still just creating a graph here. The actual training will happen later when we run the graph.

There are a few optimizers to chose from. I chose tf.train.RMSPropOptimizer because, like the sigmoid, it works well for cases where all output values can be large. For classifying things as when doing image classification, tf.train.GradientDescentOptimizer might be better.
Training And Using The Binary Counter

Having created the graph, it’s time to do the training. Once it’s trained, we can then use it.

inputvals = [[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1],
[1, 1, 0], [1, 1, 1]]
targetvals = [[0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1], [1, 1, 0],
[1, 1, 1], [0, 0, 0]]

First, we have some training data: inputvals and targetvals. inputvals contains the inputs, and for each one there’s a corresponding targetvals target value. For inputvals[0] we have [0, 0, 0], and the expected output is targetvals[0], which is [0, 0, 1], and so on.

if do_training == 1:

for i in range(10001):
if i%100 == 0:
train_error = cross_entropy.eval(feed_dict={x: inputvals, y_:targetvals})
print(“step %d, training error %g”%(i, train_error))
if train_error < 0.0005:
break, feed_dict={x: inputvals, y_: targetvals})

if save_trained == 1:
print(“Saving neural network to %s.*”%(save_file))
saver = tf.train.Saver(), save_file)

do_training and save_trained can be hardcoded, and changed for each use, or can be set using command line arguments.

We first go through all those Variable ops and have them initialize their tensors.

Then, for up to 10001 times we run the graph from the bottom up to the train_step tensor, the last thing we added to our graph. We pass inputvals and targetvals to train_step‘s op or ops, which we’d added using RMSPropOptimizer. This is the step that adjusts all the weights such that the given inputs will result in something close to the corresponding target outputs. If the error between target outputs and actual outputs gets small enough sooner, then we break out of the loop.

If you have thousands of input/output pairs then you could give it a subset of them at a time, the batch we spoke of earlier. But here we have only eight, and so we give all of them each time.

If we want to, we can also save the network to a file. Once it’s trained well, we don’t need to train it again.

else: # if we’re not training then we must be loading from file

print(“Loading neural network from %s”%(save_file))
saver = tf.train.Saver()
saver.restore(sess, save_file)
# Note: the restore both loads and initializes the variables

If we’re not training it then we instead load the trained network from a file. The file contains only the values for the tensors that have Variable ops. It doesn’t contain the structure of the graph. So even when running an already trained graph, we still need the code to create the graph. There is a way to save and load graphs from files using MetaGraphs but we’re not doing that here.

print(‘\nCounting starting with: 0 0 0’)
res =, feed_dict={x: [[0, 0, 0]]})
print(‘%g %g %g’%(res[0][0], res[0][1], res[0][2]))
for i in range(8):
res =, feed_dict={x: res})
print(‘%g %g %g’%(res[0][0], res[0][1], res[0][2]))

In either case we try it out. Notice that we’re running it from the bottom of the graph up to the results tensor we’d talked about above, the duplicate output we’d created especially for when making use of the trained network.

We give it 000, and hope that it returns something close to 001. We pass what was returned, back in and run it again. Altogether we run it 9 times, enough times to count from 000 to 111 and then back to 000 again.
Running the binary counter
Running the binary counter

Here’s the output during successful training and subsequent counting. Notice that it trained within 200 steps through the loop. Very occasionally it does all 10001 steps without reducing the training error sufficiently, but once you’ve trained it successfully and saved it, that doesn’t matter.
The Next Step

As we said, the code for the binary counter neural network is on our github page. You can start with that, start from scratch, or use any of the many tutorials on the TensorFlow website. Getting it to do something with hardware is definitely my next step, taking inspiration from this robot that [Lukas Biewald] made recognize objects around his workshop.

What are you using, or planning to use TensorFlow for? Let us know in the comments below and maybe we’ll give it a try in a future article!
Posted in Featured, Skills, slider, software hacks

Numerai deep learning

I have used the python for coding the convolutional neural network. The code has been taken from [6] and was modified to work for this dataset. The convolutional neural network was built using the Theano library in Python [1]. The model is simplified by this 23 implementation because it does not implement location-specific gain and bias parameters and also it implements pooling by maximum and not by average. The LeNet5 model uses logistic regression for image classification. The convolutional neural network was trained by passing the compressed train, test and validation datasets. There is one bias per output feature map. The feature maps are convolved with filters and each feature map is individually downsampled using max-pooling. The compressed dataset is divided into small batch sizes so as to reduce the overhead of computing and copying data for each individual image. The batch size for this model is set to 500. We keep the learning rate which is the factor for the stochastic gradient as 0.1. The maximum number of epochs for running the optimizer is kept as 200 which means the learning for each label goes on for 200 epochs so as to optimize the network. When the first convolutional pooling layer is constructed, filtering reduces the image size to 24×24, which is further reduced to 12×12 by max-pooling. During the construction of the second convolutional pooling layer the image size is reduced to 8×8 by filtering and max-pooling reduces it further to 4×4. Since the hidden layer is fully-connected it operates on 2D matrices of rasterized images. This generates a matrix of shape (500, 800) with default values. The values of the fully-connected hidden layer are classified using Logistic Regression. The cost which is minimized during training is the negative log likelihood of the model. A Theano function [2] test model is constructed to compute the incorrect calculations that are made by the model. We create two lists, one of all model parameters that have to be fit by the gradient descent and the other of gradients of all model parameter. The updating of the model parameters by Stochastic Gradient Descent(SGD) is done by the Train Model which is a Theano function. Manually creating update rules for each model parameters results in being tedious because of many parameters present in this model. The updates list is thus created by looping over all pairs automatically. We keep a improvement threshold which means that a relative improvement of this much value is considered as significant. Once the training of the convolutional neural network is done, it is the train model that is returned.

Deep learning refers to a class of machine learning techniques, where many layers of information processing stages in hierarchical architectures are exploited for pattern classifi- cation and for feature or representation learning [10]. It lies in the intersections of several research areas, including neural networks, graphical modeling, optimization, pattern recognition, and signal processing, etc. [5] Yann LeCun adopted the deep supervised backpropagation convolutional network for digit recognition. In the recent past, it has become a valuable research topic in the fields of both computer vision and machine learning where deep learning achieves state-of-the art results for a variety of tasks. The deep convolutional neural networks (CNNs) proposed by Hinton came out first in the image classification task of Imagenet classification with deep convolutional neural networks. The model was trained on more than one million images, and has achieved a winning top-5 test error rate of 15.3% over 1, 000 classes. After that, some recent works got better results by improving CNN models. The top-5 test error rate decreased to 13.24% in by training the model to simultaneously classify, locate and detect objects. Besides image classification, the object detection task can also benefit from the CNN model, as reported in. Generally speaking, three important reasons for the popularity of deep learning today are drastically increased chip processing abilities (e.g., GPU units), the significantly lower cost of computing hardware, and recent advances in machine learning and signal/information processing research. Over the past several years, a rich family of deep learning techniques has been proposed and extensively studied, e.g., Deep Belief Network (DBN), Boltzmann Machines (BM), Restricted Boltzmann Machines (RBM), Deep Boltzmann Machine (DBM), Deep Neural 6 Networks (DNN), etc. Among various techniques, the deep convolutional neural networks, which is a discriminative deep architecture and belongs to the DNN category, has found state-of-the-art performance on various tasks and competitions in computer vision and image recognition. Specifically, the CNN model consists of several convolutional layers and pooling layers, which are stacked up with one on top of another. The convolutional layer shares many weights, and the pooling layer sub-samples the output of the convolutional layer and reduces the data rate from the layer below. The weight sharing in the convolutional layer, together with appropriately chosen pooling schemes, endows the CNN with some invariance properties (e.g., translation invariance). My work is similar to the work of Ji Wan et al.[10] but differs from them in the sense that the dataset I am using is different from the ones they have used in their study. Also my approach of image matching will be completely novel which has not been used in any study similar to mine.

Numerai deep learning example

The research in the past decade related to CBIR touched many aspects of the problem and it was seen that deep learning gave the best results. The problem of annotated images was also touched upon but it was not used with the deep learning method. In my thesis I propose to show better results for annotated images using not only the images but also the annotations provided with each image. I will be using convolutional neural network

Numerai deep learning example

I have worked in the past on CBIR using Bag-of-Words model with the same dataset. The results I get in my study have been evaluated against the results I achieved in my previous study and are discussed in chapter 5. The work has also been compared against the results shown in Ji Wan et al’s work [10] where they used deep learning for CBIR with various different datasets but they differ in the sense that the datasets used by them were just plain images without any annotations.

Numerai deep learning example

”Convolutional neural network (CNN) is a type of feed-forward artificial neural network where the individual neurons are tiled in such a way that they respond to overlapping regions in the visual field” [11]. They are biologically-inspired invariant of Multilayer Perceptrons (MLP) which are designed for the purpose of minimal preprocessing. These models are widely used in image and video recognition. When CNNs are used for image recognition, they look at small portions of the input image called receptive fields with the help of multiple layers of small neuron collections which the model contains [11]. The results we get from this collection are tiled in order for them to overlap such that a better representation of the original image is obtained; every such layer repeats this process. This is the reason they are able if the input image is translated in any way. The outputs of neuron clusters are combined by local or global pooling layers which may be included in convolutional networks. Inspired by biological process, convolutional networks also contain various combinations of fully connected layers and convolutional layers, with point-wise nonlinearity applied at the end of or after each layer [11]. The convolution operation is used on small regions so as to avoid the situation when if all the layers are fully connected billions of parameters will exist. Convolutional networks use shared weights in the convolutional layers i.e. for each pixel in the layer same filter (weights bank) is used which is advantageous because it reduces the required memory size and improves performance. CNNs use relatively less amount of pre-processing as compared to other image classification algorithms,

Numerai deep learning example

CNNs enforce a local connectivity pattern between neurons of adjacent layers to exploit spatially-local correlation [6]. We have illustrated in fig.4.1 that in layer m the inputs of hidden units are from a subset of units in layer m-1, units containing spatially adjoining receptive fields.

Numerai deep learning example

Every filter hi in CNNs is duplicated across the complete visual field. The duplicated filters consists of the same parameters i.e. weights and bias that form a feature map. We can see in fig.4.2 that same feature map contains 3 hidden units. The weights of same color are shared that are constrained to be identical [6]. We can still use gradient descent to learn such shared parameters by altering the original algorithm by a very small margin. When the gradients of the shared parameters are summed, then it gives the gradient of a shared weight. We can detect the features regardless of their location in the visual field by duplicating the units. The huge reduction of the number of free parameters being learnt can lead to weight sharing increasing the learning efficiency. CNNs achieve better generalization on vision problems due to the constraints on these models.

Numerai deep learning example

We obtain a feature map by repeatedly applying a function across sub-regions of the entire image, mainly by convolution of the input image with a linear filter, adding a bias term and 19 then applying a non-linear function [6]. The k-th feature map can be denoted as h k at a given layer, whose filters we can determine by the bias b k and weights Wk , then we can obtain the feature map by the given equation:

Numerai deep learning example

depicts 2 layers of CNN. There are 4 feature maps in layer m-1 and 2 feature maps in hidden layer m (h 0 and h 1 ). The pixels of layer (m-1) that lie within their 2×2 receptive field in the layer below (colored squares) are used for the computation of the pixels in the feature maps h 0 and h 1 (blue and red squares). It can be observed that how all 4 input feature maps are spanned by the receptive field. As a result the 3D weight tensors are the weights and of and . The input feature maps is indexed by the leading dimensions, whereas the pixel coordinates is referred by the other two. When we combine it all as shown in fig.4.3, at layer m the weight that connects each pixel of the k-th feature map with the pixel of the l-th layer at layer (m-1) and at coordinates (i,j) is denoted [6] .

Numerai deep learning example

Max-pooling a form of non-linear down-sampling is an important concept of CNNs. The input image is partitioned into a group of non-overlapping rectangles and a maximum value is given for each such sub-region. We use max-pooling in vision for the following reasonsThe computation of upper layers is reduced by the removal of non-maximal values. Suppose a max-pooling layer is cascaded with a convolutional layer. The input image can be translated by a single pixel in 8 directions. 3 out of 8 possible configurations produce exactly the same output at the convolutional layer if max-pooling is done over a 2×2 region. This jumps to 5/8 for max-pooling over a 3×3 region [6]. A form of translation invariance is provided by this. The dimensionality of intermediate representations is reduced by max-pooling because it provides additional robustness to position.

Numerai deep learning example

The dataset I chose for this thesis is from the SUN database [12]. The major reason for choosing this dataset was that the images in it were pre-annotated and had annotations as XML files for each image. The SUN database is huge so I had to choose a small subset of it for this study. In this study I am trying to classify images based on 8 classes namely: water, car, mountain, ground, tree, building, snow, sky and unknown which contains all the rest of the classes. I chose only those sets of images which I felt were more relevant to these classes. I collected a database of 3000 images from 41 categories. Each image has its annotations in an XML file. I randomly divided the dataset into 80% training set and 20% testing. There are 1900 training images, 600 testing images and 500 validation images. The training set was further divided into 80% training set and 20% validation set. The major drawback of this dataset is that the images are annotated by humans and the annotations are not perfect thus it may have some effect on the results. I try to handle this problem by getting as many synonyms as I can for each class label. A few examples of the synonyms are lake, lake water, sea water, river water, wave, ripple, river, sea, river water among others which all belong to the class label water. I mapped these synonyms to their respective class labels which are being used. Not all images in every categories were annotated. I filtered out the annotated images from the dataset and used only them for this study. Fig.4.5 shows an example of an image from the dataset and its annotation file where it can be seen how a river is annotated by the user

Data Science Courses

During the last decade, a great scientific effort has been invested in the development of methods that could provide efficient and effective detection of botnets. As a result, various detection methods based on diverse technical principles and various aspects of botnet phenomena have been defined. Due to promise of non-invasive and resilient detection, botnet detection based on network traffic analysis has drawn a special attention of the research community. Furthermore, many authors have turned their attention to the use of machine learning algorithms as the mean of inferring botnet-related knowledge from the monitored traffic. This paper presents a review of contemporary botnet detection methods that use machine learning as a tool of identifying botnet-related traffic. The main goal of the paper is to provide a comprehensive overview on the field by summarizing current scientific efforts. The contribution of the paper is threefold. First, the paper provides a detailed insight on the existing detection methods by investigating which bot-related heuristic were assumed by the detection systems and how different machine learning techniques were adapted in order to capture botnetrelated knowledge. Second, the paper compares the existing detection methods by outlining their characteristics, performances, and limitations. Special attention is placed on the practice of experimenting with the methods and the methodologies of performance evaluation. Third, the study indicates limitations and challenges of using machine learning for identifying botnet traffic and outlines possibilities for the future development of machine learning-based botnet detection systems.


Data Science Courses

Data Science Courses

Data Science Courses

Data Science Courses

Data Science Courses

Data Science Courses

Data Science Courses

Data Science Courses


Machine Learning.

Supervised versus unsupervised learning Machine learning is a branch of artificial intelligence that uses algorithms, for example, to find patterns in data and make predictions about future events. In machine learning a dataset of observations called instances is comprised of a number of variables called attributes. Supervised learning is the modeling of these datasets 46 Table 3.1: An example of a supervised learning dataset Time x1 x2 x3 x4 x5 x6 x7 y 09:30 b n -0.06 -116.9 -21.7 28.6 0.209 up 09:31 b b 0.06 -85.2 -61 -21.7 0.261 unchanged 09:32 b b 0.26 -4.4 -114.7 -61 0.17 down 09:33 n b 0.11 -112.7 -132.5 -114.7 0.089 unchanged 09:34 n n 0.08 -128.5 -101.3 -132.5 0.328 down containing labeled instances. In supervised learning, each instance can be represented as (x, y), where x is a set of independent attributes (these can be discrete or continuous) and y is the dependent target attribute.

Machine Learning

The target attribute y can also be either continuous or discrete; however the category of modeling is regression if it contains a continuous target, but classification if it contains a discrete target (which is also called a class label). Table 3.1 demonstrates a dataset for supervised learning with seven independent attributes x1, x2, . . . , x7, and one dependent target attribute y. More specifically, x1, x2 ∈ {b, n} and x3, . . . , x7 ∈ R and the target attribute y ∈ {up,unchanged,down}. The attribute time is used to identify an instance and is not used in the model. Also the training and test datasets are represented in the same way however, where the training set contains a set of vectors of known label (y) values, the labels for the test set is unknown. In unsupervised learning the dataset does not include a target attribute, or a known outcome. Since the class values are not determined a priori, the purpose of this learning technique is to find similarity among the groups or some intrinsic clusters within the data. A very simple two-dimensional (two attributes) demonstration is 47 Figure 3.1: An example of an unsupervised learning technique – clustering shown in Figure 3.1 with the data partitioned into five clusters.

Machine Learning

A case could be made however that the data should be partitioned into two clusters or three, etc.; the “correct” answer depends on prior knowledge or biases associated with the dataset to determine the level of similarity required for the underlying problem. Theoretically we can have as many clusters as data instances, although that would defeat the purpose of clustering. Depending on the problem and the data available, the algorithm required can be either a supervised or unsupervised technique. In this thesis, the goal is to predict future price direction of the streaming stock dataset. Since the future direction becomes known after each instance, the training set is constantly expanding 48 with labeled data as time passes. This requires a supervised learning technique. Additionally, we explore the use of different algorithms since some may be better depending on the underlying data. Care should be taken to avoid, “when all you have is a hammer, everything becomes a nail.” 3.3 Supervised learning algorithms 3.3.1 k Nearest-neightbor The k nearest neighbor (kNN) is one of the simplest machine learning methods and is often referred to as a lazy learner because learning is not implemented until actual classification or prediction is required. It takes the most frequent class as measured by the weighted euclidean distance (or some other distance measure) among the k closest training examples in the feature space. In specific problems such as text classification, kNN has been shown to work as well as more complicated models [240]. When nominal attributes are present, it is generally advised to arrive with a “distance” between the different values of the attributes [236]. For our dataset, this could apply to the different trading days, Monday, Tuesday, Wednesday, Thursday, and Friday.

Machine Learning

A downside of using this model is the slow classification times, however we can increase speed by using dimensionality reduction algorithms; for example, reducing the number of attributes from 200 to 20. Since the learning is not implemented until the classification phase though, this is an unsuitable algorithm to use when decisions are needed quickly. 49 3.3.2 Na¨ıve Bayes The na¨ıve Bayes classifier is an efficient probabilistic model based on the Bayes Theorem that examines the likelihood of features appearing in the predicted classes. Given a set of attributes X = {x1, x2, . . . , xn}, the objective is to construct the posterior probability for the event Ck among a set of possible class outcomes C = {c1, c2, . . . , ck}. Therefore, with Bayes’ rule P(Ck|x1, . . . , xn) ∝ P(Ck)P(x1, . . . , xn|Ck), where P(x1, . . . , xn|Ck) is the probability that attribute X belongs to Cj , and assuming independence1 we can rewrite as P(Cj |X) ∝ P(Cj ) �n i=1 P(xi|Cj ). A new instance with a set of attributes X is labeled with the class Cj that achieves the highest posterior probability. 3.3.3 Decision table A decision table classifier is built on the conceptual idea of a lookup table. The classifier returns the majority class of the training set if the decision table (lookup table) cell matching the new instance is empty. In certain datasets, classification performance has been found to be higher when using decision tables than with more complicated models. A further description can be found in [124, 125, 127]. 3.3.4 Support Vector Machines Support vector machines [221] have long been recognized as being able to efficiently handle high-dimensional data. Originally designed as a two-class classifier, it can work with more classes by making multiple binary classifications (one-versus- 1The assumption of independence is the na¨ıve aspect of the algorithm. 50 one between every pair of classes). The algorithm works by classifying instances based on a linear function of the features. Additionally non-linear classification can be performed using a kernel. The classifier is fed with pre-labeled instances and by selecting points as support vectors the SVM searches for a hyperplane that maximizes the margin. More information can be found in [221]. 3.3.5 Artificial Neural Networks An artificial neural network (ANN) is an interconnected group of nodes intended to represent the network of neurons in the brains. They are widely used in literature, because of their ability to learn complex patterns. We present only a short overview of their structure in this section. The artificial neural network is comprised of nodes (shown as circles in Figure 3.2), an input layer represented as x1 . . . , x6, an optional hidden layer, and an output layer y. The objective of the ANN is to determine a set of weights w (between the input, hidden, and output nodes) that minimize the total sum of squared errors. During training these weights wi are adjusted according to a learning parameter λ ∈ [0, 1] until the outputs become consistent with the output. Large values of λ may make changes to the weights that are too drastic, while values that are too small may require more iterations (called epochs) before the model sufficiently learns from the training data. The difficulty of using artificial neural networks is finding parameters that learn from training data without over fitting (i.e. memorizing the training data) and 51 x1 x2 x3 x4 x5 x6 input layer hidden layer output layer y Figure 3.2: Example of a multilayer feed-forward artificial neural network therefore perform poorly on unseen data. If there are too many hidden nodes, the system may overfit the current data, while if there are too few, it can prevent the system from properly fitting the input values. Also, a choice of stopping criterion has to be chosen. This can include stopping based on when the total error of the network falls below some predetermined error level or when a certain number of epochs (iterations) has been completed [16, 25, 177]. To demonstrate this, see Figure 3.3. This plot represents a segment of our high-frequency trade data that will be used later in this thesis. As the epochs increase (by tens), the number of incorrectly identified training instances decreases, as seen by the decrease in the training error. However, the validation error decreases until 30 epochs, and after 30, starts to increase. Around roughly 80 epochs the validation error begins to decrease again, however we need to make a judgment call since an increase in epochs increases the training times dramatically. Yu et al. [245] state that with foreign exchange rate forecasting, which is similar to stocks because of the high degree of noise, volatility and complexity, it is advisable to use the sigmoidal type-transfer function (i.e. logistic or hyperbolic tangent). Tigure 3.3: Artificial neural network classification error versus number of epochs base this on the large number of papers that find predictability using this type of function in the hidden layer. 3.3.6 Decision Trees The decision tree is one of the more widely used classifiers in practice because the algorithm creates rules which are easy to understand and interpret. The version we use in this paper is also one of the most popular forms, the C4.5 [186], which extends the ID3 [185] algorithm. The improvements are: 1) it is more robust to noise, 2) it allows for the use of continuous attribute, and 3) it works with missing data. The C4.5 begins as a recursive divide-and-conquer algorithm, first by selecting an attribute from the training set to place at the root node. Each value of the attribute 53 creates a new branch, with this process repeating recursively using all the instances reaching that branch [236]. An ideal node contains all (or nearly all) of one class. To determine the best attribute to choose for a particular node in the tree, the gain in information entropy for the decision is calculated. More information can be found in [186]. 3.3.7 Ensembles An ensemble is a collection of multiple base classifiers that take a new example, pass it to each of its base classifiers, and then combines those predictions according to some method (such as through voting). The motivation is that by combining the predictions, the ensemble is less likely to misclassify. For example, Figure 3.4a demonstrates an ensemble with 25 hypothetical classifiers, each with an independent error rate of 0.45 (assuming a uniform 2 class problem). The probability of getting k incorrect classifier votes is a binomial distribution, P(k) = �n k � pk(1 − p)n−k. The probability that 13 or more is in error is 0.31, which is less than the error rate of the individual classifier. This is a potential advantage of using multiple models. This advantage of using multiple models (ensembles) is under the assumption that the individual classifier error rate is less than 0.50. If the independent classifier error rate is 0.55, then the probability of 13 or more in error is 0.69 – it would be better not to use an ensemble of classifiers. Figure 3.4b2 demonstrates the error rate of the ensemble for three independent error rates, 0.55, 0.50, and 0.45 for ensembles 2The idea for the visualization came from [59, 82]. 54 Error rate versus number of classifiers in the ensemble (employing majority voting) for three independent error rates Figure 3.4: Ensemble simulation containing an odd number of classifiers, from 3 to 101. From the figure it can be seen that the smaller the independent classifier error rate is, and the larger the number of classifiers in the ensemble is, the less likely a majority of the classifiers will predict incorrectly [59, 82]. The idea of classifier independence may be unreasonable, given that the classifiers may predict in a similar manner due to the training set. Obtaining a base classifier that generates errors as uncorrelated as possible is ideal. Creating a diverse set of classifiers within the ensemble is considered an important property since the likelihood that a majority of the base classifiers misclassify the instance is decreased. Two of the more popular methods used within ensemble learning is bagging [27] and boosting (e.g. the AdaBoost algorithm [78] described in Subsection is the most common). These methods promote diversity by building base classifiers on different subsets of the training data or different weights of classifiers. 55 Bagging Bagging, also known as bootstrap aggregation, was proposed by Breiman in 1994 in an early version of [27]. It works by generating k bootstrapped training sets and building a classifier on each (where k is determined by the user). Each training set of size N is created by randomly selecting instances from the original dataset, with each receiving an equal probability of being selected and with replacement. Since every instance has an equal probability of being selected, bagging does not focus on any particular instance of the training data and therefore is less likely to over-fit [177]. Bagging is generally for unstable3 classifiers such as decision trees and neural networks. Boosting The AdaBoost (Adaptive Boosting) algorithm of Freud and Schapire [78] in 1995 is synonymous with boosting. The idea however was proposed in 1988 by Michael Kearns [114] in a class project, where he hypothesized that a “weak” classifier, performing slightly better than average, could be “boosted” into a “strong” classifier. In boosting, instances being classified are assigned a weight; instances that were previously incorrectly classified receive larger weights, with the hope that subsequent models correct the mistake of the previous model. In the AdaBoost algorithm the original training set D has a weight w assigned to each of its N instances {(x1, y1), . . . ,(xn, yn)}, where xi is a vector of inputs and yi is the class label of that 3By unstable, it is meant that small changes in the training set can lead to large changes in the classifier outcome. 56 instance. With the weight added the instances become {(x1, y1, w1), . . . ,(xn, yn, wn)} and the sum of the wi must equal 1. The AdaBoost algorithm then builds k base classifiers with an initial weight wi = 1 N . Upon each iteration of the algorithm (which is determined by the user), the weight wi gets adjusted according to the error �i of the classifier hypothesis4. The points that were incorrectly identified receive higher weights, and the ones that were correctly identified receive less. The desire is that on the next iteration, the re-weighting will help to correctly classify the instances that were misclassified by the previous classifier. When implementing the boosting ensemble on test data, the final class is determined by a weighted vote of the classifiers [78, 149]. Boosting does more to reduce bias than variance. This reduction is due to the algorithm adjusting its weight to learn previously misclassified instances and therefore increasing the probability that these instances will be learned correctly in the future. This has had a tendency to correct biases. However, it tends to perform poorly on noisy datasets and therefore the weights become greater, which causes the model to focus on the noisy instances and over-fit the data [195]. Combining classifiers for ensembles The last step in any ensemble-based system is the method used to combine the individual classifiers; this is often referred to as fusion rules. Classifiers within an ensemble are most commonly combined using a majority voting algorithm. There 4If the error is greater than what would be achieved by guessing the class, then the ensemble is returned to the previously generated base classifier. 57 are however, different methods of combining, which often depend on the underlying classifiers used. For example, the Naive Bayes algorithm provides continuous valued outputs, allowing a wide range of strategies for combining, while an artificial neural network provides a discrete-valued output, allowing for fewer [133, 134, 247]. A description of each follows: • Majority voting – Plurality majority voting – The class that receives the highest number of votes among classifiers (in literature, majority voting typically refers to version) – Simple majority voting – The class that receives one more than fifty percent of all votes among classifiers – Unanimous majority voting – The class that all the classifiers unanimously vote on • Weighted majority voting – If the confidence in among classifiers is not equal, we can weight certain classifiers more heavily. This method is followed in the AdaBoost algorithm. • Algebraic combiners – Mean/Minimum/Maximum/Median rules – The ensemble decision is chosen for the class according to the average/minimum/maximum/median of each classifier’s confidence. 58 Table 3.2: Confusion matrix Predicted class + – Actual + TP FN Class – FP TN While ensembles have shown success in a variety of problems, there are some associated drawbacks. This includes added memory and computation cost in keeping multiple classifiers stored and ready to process. Also the loss of interpretability may be a cause for concern depending on the needs of the problem. For example, a single decision tree can be easily interpreted, while an ensemble of 100 decision trees could be difficult [21]. 3.4 Performance metrics 3.4.1 Confusion matrix and accuracy A confusion matrix, also called a contingency table, is a visualization of the performance of a supervised learning method. A problem with n classes, requires a confusion matrix of size n × n with the rows representing the specific actual class and the columns representing the classifiers predicted class. In a confusion matrix, TP (true positive) is the number of positives correctly identified, TN (true negative) is the number of negatives correctly identified, FP (false positive) is the number of negatives incorrectly identified as positive, and FN (false negative) is the number of positives incorrectly identified as negatives. An example of a confusion matrix can be seen in Table 3.2. 59 From the confusion matrix it is relatively simple to arrive at different measures for comparing models. An example is accuracy, which is a widely used metric and is easy to interpret. From Equation 3.1, accuracy is the total number of correct predictions made over the total number of predictions made. While accuracy is a popular metric, it is also not very descriptive when used to measure the performance of a highly imbalanced dataset. A model may have high levels of accuracy, but may not obtain high levels of identification of the class that we are interested in predicting. For example, if attempting to identify large moves in a stock which is comprised of 99% small moves and 1% large moves, it is trivial to report a model has accuracy of 99% without additional information. A classifier could also have 99% accuracy by simply reporting the class with the largest number of instances (e.g. the majority class is “small moves”). In an imbalanced dataset, a model may misidentify all positive classes and still have high levels of accuracy; pure randomness is not taken into account with the accuracy metric. Accuracy’s complement is the error rate (1 − Accuracy) and can be seen in Equation 3.2. Accuracy = T P + T N T P + T N + F P + F N (3.1) Error rate = F P + F N T P + F P + T N + F N (3.2) There are several approaches to comparing models with imbalanced datasets. First is the precision and recall metrics and the accompanying harmonic mean, the F-measure. The second metric is based on Cohen’s kappa statistic, which takes into account the randomness of the class. The third metric is the receiver operating characteristic which is based on the true positive and false positives rates. The 60 fourth is a cost-based metric which gives specific “costs” to correctly and incorrectly identifying specific classes. And the last method is based not on the ability of the model to make correct decisions, but instead on the profitability of the classifier as it applies to a trading system. A more detailed description of these metrics follows. 3.4.2 Precision and recall Precision and recall are both popular metrics for evaluating classifier performance and will be used extensively in this paper. Precision is the percentage that the model correctly predicts positive when making a decision (Equation 3.3).

Machine Learning

More specifically, precision is the number of correctly identified positive examples divided by the total number of examples that are classified as positive. Recall is the percentage of positives correctly identified out of all the existing positives (Equation 3.4); it is the number of correctly classified positive examples divided by the total number of true positive examples in the test set. From our imbalanced example above with the 99% small moves and 1% large moves, precision would be how often a large move was correctly identified as such, while recall would be the total number of large moves that are correctly identified out of all the large moves in the dataset. Precision = T P T P + F P (3.3) Sensitivity (Recall) = T P T P + F N (3.4) Specificity = T N T N + F P (3.5) F-measure = 2(precision)(recall) precision + recall (3.6) Precision and recall are often achieved at the expense of the other, i.e. high 61 precision is achieved at the expense of recall and high recall is achieved at the expense of precision. An ideal model would have both high recall and high precision. The F-measure5, which can be seen in Equation 3.6, is the harmonic measure of precision and recall in a single measurement. The F-measure ranges from 0 to 1, with a measure of 1 being a classifier perfectly capturing precision and recall. 3.4.3 Kappa The second approach to comparing imbalanced datasets is based on Cohen’s kappa statistic. This metric takes into consideration randomness of the class and provides an intuitive result. From [14], the metric can be observed in Equation 3.7 where P0 is the total agreement probability and Pc is the agreement probability which is due to chance. κ = P0 − Pc 1 − Pc (3.7) P0 = � I i=1 P(xii) (3.8) Pc = � I i=1 P(xi.)P(x.i) (3.9) The total agreement probability P0 (i.e. the classifier’s accuracy) can be be computed according to Equation 3.8, where I is the number of class values, P(xi.) is the row marginal probability and P(x.i) is the column marginal probability, with both obtained from the confusion matrix. The probability due to chance, Pc, can be computed according to Equation 3.9. The kappa statistic is constrained to the interval 5The F-measure, in the literature is also called the F-score and the F1-score. 62 Table 3.3: Computing the Kappa statistic from the confusion matrix (a) Confusion matrix – Numbers Predicted class up down flat Actual up 139 80 89 308 class down 10 298 13 323 flat 40 16 313 369 189 396 4157 1000 (b) Confusion matrix – Probabilities Predicted class up down flat  Actual up 0.14 0.08 0.09 0.31 class down 0.01 0.30 0.01 0.32 flat  0.04 0.02 0.31 0.37 0.19 0.40 0.42 1.00 [−1, 1], with a kappa κ = 0 meaning that agreement is equal to random chance, and a kappa κ equaling 1 and -1 meaning perfect agreement and perfect disagreement respectively. For example, in Table 3.3a the results of a three-class problem are shown, with the marginal probabilities calculated in Table 3.3b. The total agreement probability, also known as accuracy, is computed as P0 = 0.14 + 0.30 + 0.31 = 0.75, while the probability by chance is Pc = (0.19×0.31) + (0.40×0.32) + (0.42×0.37) = 0.34. The kappa statistic is therefore κ = (0.75 − 0.34)/(1 − 0.34) = 0.62. 3.4.4 ROC The third approach to comparing classifiers is the Receiver Operating Characteristic (ROC) curve. This is a plot of the true positive rate, which is also called recall or 63 Figure 3.5: ROC curve example sensitivity (Equation 3.10), against the false positive rate, which is also known as 1-specificity (3.11). T P R = T P T P + F N (3.10) F P R = F P T N + F P (3.11) The best performance is noted by a curve close to the top left corner (i.e. a small false positive rate and a large true positive rate), with a curve along the diagonal reflecting a purely random classifier. As a demonstration, in Figure 3.5 three ROC curves are displayed for three classifiers. Classifier 1 has a more ideal ROC curve than Classifier 2 or 3. Classifier 2 is slightly better than random, while Classifier 3 is worse. In Classifier 3’s case, it would be better to choose as a solution that is opposite of what the classifier predicts. 64 For single number comparison, the Area Under the ROC Curve (AUC) is calculated by integrating the ROC curve. Random would therefore have an AUC of 0.50 and a classifier better and worse than random would have an AUC greater than and less than 0.50 respectively. It is most commonly used with two-class problems although with multi-class examples the AUC can be weighted according to the class distribution. AUC is also equal to the Wilcoxon statistic. 3.4.5 Cost-based The cost-based method of evaluating classifiers is based on the “cost” associated with making incorrect decisions [61, 65, 102]. The performance metrics seen thus far do not take into consideration the possibility that not all classification errors are equal. For example, an opportunity cost can be associated with missing a large move in a stock. A cost can also be provided for initiating an incorrect trade. A model can be built with a high recall, which misses no large moves in the stock, but the precision would most likely suffer. The cost-based approach gives an associated cost to this decision which can be evaluated to determine the suitability of the model. A cost matrix is used to represent the associated cost of each decision with the goal of minimizing the total cost associated with the model. This can be formalized with a cost matrix C and the entry (i, j) with the actual cost i and the predicted class j. When i = j the prediction is correct and when i �= j the prediction is incorrect. An advantage of using a cost-based evaluation metric for trading models is the cost associated with making incorrect decisions is known by analyzing empirical 65 data.

Machine Learning

For example all trades incur a cost in the form of a trade commission and money used in a trade is temporarily unavailable, thus incurring an opportunity cost. Additionally, a loss associated with an incorrect decision can be averaged over similar previous losses; gains can be computed similarly. Consider, for example, a trading firm is attempting to predict the directional price move of a stock with the objective to trade on the decision. At time t, the stock can move up, down, or have no change in price; at time t+n, the direction is unknown (this can be observed in Figure 3.6). For time t + 1, a prediction of up might result in the firm purchasing the stock. Different errors in classification however would have different associated cost. A firm expecting a move up would purchase the stock in anticipation of the move, but a subsequent move down would be more harmful than no change in price. A actual move down would immediately result in a trading loss, whereas no change in price would result in an temporary opportunity cost with the stock still having the potential to go in the desired direction. Additionally an incorrect prediction of “no change” would merely result in an opportunity lost, but no actual money being put to risk since a firm would not trade based on the anticipation of a unchanged market (no change). Table 3.4 represents a theoretical cost matrix of the problem, with three separate error amounts represented: 0.25, 0.50, and 1.25. 3.4.6 Profitability of the model While the end result of predicting stock price direction is to increase profitability, the performance metrics discussed thus far (with the exception of the cost-based 66 Predicted class Down No change Up Actual Down 0 0.25 1.25 class No change 0.50 0 0.50 Up 1.25 0.25 0 metric) evaluate classifiers based on the ability to correctly classify and not on overall profitability of a trading system. As an example, a classifier may have very high accuracy, kappa, AUC, etc. but this may not necessarily equate to a profitable trading strategy, since profitability of individual trades may be more important than being “right” a majority of times; e.g. making $0.50 on each of one hundred trades is not as profitable as losing $0.05 95 times and then making $12 on each of five trades6. Figure 3.7 represents a trading model represented in much of the academic literature, where the classifier is built on the data with a prediction of up, down, or no change in the market price with the outcome passed to a second set of rules. These 6An argument can also be made that a less volatile approach is more ideal (i.e. making small sums consistently). This depends on the overall objective of the trader – maximizing stability or overall profitability. 67 Figure 3.7: Trading algorithm process rules provide direction if a prediction of “up”, for example, should equate to buying stock, buying more stock, or buying back a position that was previously shorted. The rules also address the amount of stock to be purchased, how much to risk, etc. When considering profitability of a model, the literature generally follows the form of an automated trading model, which is “buy when the model says to buy, then sell after n number of minutes/hours/days [161]” or “buy when the model says to buy, then sell if the position is up x% or else sell after n minutes/days/hours [138, 164, 202].” Teixeira et al. [214] added another rule (called a “stop loss” within trading), which prevented losses from going past a certain dollar amount during an individual trade. The goal of this thesis is not to provide an “out of the box” trading system with proven profitability, but to instead help the user make trading decisions with the help of machine learning techniques. Additionally, there are many different rules in the trading literature relating to how much stock to buy or sell, how much money to risk in a position, how often trades should take place, and when to buy and sell; each of these questions are enough for entire dissertations. In practice, trading systems often involve many layers of controls such as forecasting and optimization methodologies 68 Table 3.5: Importance of using an unbiased estimate of its generalizability – trained using the dataset from Appendix B for January 3, 2012 January 3, 2012 (training data) January 4, 2012 (unseen data) Accuracy 94.713% 37.31% that are filtered through multiple layers of risk management. This typically involves a human supervisor (risk manager) that can make decisions such as when to override the system [69]. The focus of this paper therefore, will remain on the classifier itself; maximizing predictability when faced with different market conditions.

Machine Learning


Machine Learning


Machine Learning


Machine Learning









deep learning,data mining,machine learning,artificial intelligence,kaggle,tensor flow,data scientist,neural network,what is machine learning,machine learning algorithms

Multiple regression
Model validation
ROC curve
Predictive model
Loss function
L1/L2 Regularization
Response variable
Jackknife resampling/Jacknifing
MSE – mean squared error
Selection bias
Local Max/Min/Optimum
A/B Testing
Web Analytics
Root cause analysis
Big data
Data minig
Binary hipotesis test
Null hypotesis (H0)
Alternative Hypotesis (H1)
Statistical Power
Type I error
Type II error
Ridge regression
K-means clustering
Semantic Indexing
Principal Component Analysis
Supervised learning
Unsupervised learning
False positives
Fase negatives
Feature vector
Random forrest
Support Vector Machines
Colaborative Filtering
Cosine distance
Naive Bayes
Boosted trees
Decision tree
Stepwise regression
Impurity measures
Maximal margin classifier
Kernel trick
Dimensionality reduction
Dimensionality course
Newton’s method


(AI) Accuracy
(capital Association
(DAG) Attribute
(DAG). Categorical
(database, Continuous
(error/loss) A
(example, Classifier
(field, Confusion
(graph The
(incorrect) Cost
(iterative Cross-validation
(JSON)A Data
(Katz) Dimension
(LP). Error
(mathematics). Feature
(mean) Inducer
(MOLAP, Instance
(most Knowledge
(multi-dimensional Loss
(numbers Machine
(often In
(or Missing
(PageRank) Model
(pdf) OLAP
(record, On-Line
(see Record
(SNA) see
(sometimes Regressor
(SRS) Resubstitution
(the Schema
(UDF) Sensitivity
(x_{1},x_{2}, True
[ALPHA] Specificity
[BETA] Supervised
[DEPRECATED] Techniques
\frac Tuple
\frac{(x-i Unsupervised
\frac{1}{ Adjacency
\hat Aggregation
\le avg
\mu count
\mu)^{2}}{2i count_distinct
\mu, max
\pi}} min
\sigma stdev
\sigma) sum
\sigma^2}} var
\sqrt{2 Alpha
\sum_{i=1}^{n} Alternating
{x_{i} API
“bias” Functions
“noise” [ALPHA]
“prior” [BETA]
“system” [DEPRECATED]
“The Arity
“variance” ASCII
= Abbreviated
A Average
A. Bayesian
A/B Contrast
Abbreviated For
Accuracy Belief
accuracy, Beta
Acyclic Bias
Adapted Bias-variance
adding Central
additional Centrality
Adjacency From
Aggregation Character-Separated
algorithm Classification
algorithm. Clustering
Algorithmic Collaborative
algorithms Comma-Separated
algorithms, Community
all Plural
Allocation Conjugate
Allocation: Trusted
allow Connected
allows Convergence
Alpha Where
also CSV
Alternating Degree
American Deprecated
amounts Directed
Analysis Edge
analysis, Empirical
Analytical \hat
Analytics \sum_{i=1}^{n}
AnalyticsAnalysis Enumerate
analyze Verb
and Equal
anonymous Extract,
answering Extracts
any Transforms
Apache Loads
approach F-Measure
approaches F-Score
are F1
Arity float32
around float64
article Frame
Artificial GaBP
As Gaussian
as: Normal
ASCII will
Association is
assumption approaches
at f(x,
Attribute e^{-i
attributes \mu
automatically \sigma
Average Global
avg Graph
B Traversals
based Statistics
Bayesian Some
be As
because left:
behavior right:
Belief inner:
bell-shaped So
best-fitting HBase
Beta Apache
between Hyperparameter
Bias Parameter
Bias-variance int32
Big An
binary int64
Binning Ising
bits You
branch K-S
broad Katz
by Label
C Labeled
calculation Many
calculation) Lambda
calling Adapted
can This
case Further
case, Related
Categorical Warning
category Latent
centered [A]
Central Least
Centrality LineFile
Centrality. Local
Centrality: Loopy
Characteristic MapReduce
Characteristic: Markov
Character-Separated Online
class OLTP
Classification PageRank
Classifier Precision/Recall
cleaning/cleansing Property
Clickstream Python
Clustering Related:
Clustering. Quantile
Code One
code. RDF
Coefficient Receiver
Coefficient. Recommendation
Collaborative Recommender
collection Resource
column ROC
columns Row
Comma-Separated Refer
common Semi-Supervised
commonly Simple
Commonly, Smoothing
Community Wikipedia:
complex Stratified
Component Superstep
computed Supersteps
computer Tab-Separated
computing, Topic
conditional Transaction
Confusion Transactional
Conjugate UDF
conjunctive Undirected
Connected Unicode
connection Vertex
consisting A/B
containing B
context, Big
Continuous C
Contrast Clickstream
Convergence D
correct DatabaseA
corresponding F
Cost Federated
Cost. G
count Geolocation
count_distinct I
counting IngestionThe
Cross-validation J
CSV Javascript
Cumulative K
curve Key
Customer L
customers Live
D ?
Data N
data, O
DataAggregated Omnichannel
Database P
database, PortabilityAbility
DatabaseA R
databases S
DataData ScalabilityAbility
DataDevice T
defined Third
defines –
Degree learning
Degree: (field,
dependencies matrix
deployment proportion
Deprecated cleaning/cleansing
Depth mining
derived set
Descent rate
describe vector
describing /
description (example,
Detection discovery
deviation value
deviation. structure
different deployment
Dimension use
Directed (MOLAP,
direction Analytical
Directions mapping
Dirichlet description
discovery positive
Discovery, negative
discrete used
displaying List
displays representation
distributed Function
distributed, mathematical
Distribution :
distribution. method
Distribution: Maturity
DistributionA Indicates
e^{-i logic,
each Path
ECDF network
Edge Inference
Edge, probabilistic
either with
Element more
Empirical Propagation
end vs
EngineSoftware trade-off
Enumerate Tendency
Equal typical
Error (Katz)
especially (PageRank)
estimating Centrality.
ETL Values
evenly file
event process
examine Filtering
examples Variables
examples, complex
explanations Matrices
Extract, form
Extracts Gradient
F Analytics
f(x, Component
F) Acyclic
F_{n}(t) mathematics
F1 connection
Factorization Cumulative
fall F_{n}(t)
Feature observations
feature)- Indicator\{x_{i}
features Indicator\{A\}
Federated —
Field. Depth
Fields Width
fields, Transform,
file computing,
Filter it
Filtering systems
find F-Measure.
finding Score
finite real
fit (capital
float table
float32 class
float64 special
F-Measure Distribution
F-Measure. group
For fall
form evenly
format zero
format. \mu,
formed \frac{(x-i
found Random
Frame also
frame’s broad
Framework –
From are
from: Database
F-Score Element
Function that
Function. particular,
Functionality integer
Functions can
Further Test
G statistics,
GaBP Social
Gaussian multi-pass
generative additional
Geolocation machine-learning
Global from:
Gradient anonymous
Graph examples
Graph. term:
graphical term
graphs Dirichlet
graphs. generative
group Squares
grouping algorithm
groups works
Hadoop pattern
has Map
have User-defined
HBase Operating
how signal
Hyperparameter specific
I to
identically computer
identifying means
implements Relaxation
implication Sampling
importance single
Important Modeling
improving models
In Processing
independent Functionality
index TestingAnalysis
indexes DataData
Indicates AnalyticsAnalysis
indicator collection
Indicator\{A\} DataDevice
Indicator\{x_{i} Object
Inducer LearningA
induction DistributionA
Inference MarketingA
inferences EngineSoftware
information of
information, Party
information: correct
IngestionThe variable,
inner: Commonly,
input (record,
Instance induction
instances case,
instances. Cost.
int32 Discovery,
int64 and
integer ROLAP)
integrate Processing.
Intelligence vector.
intelligence. (error/loss)
interactions, Tags
Interchange, mathematics,
interpretation American
intersection Length
into topology,
is graphical
Ising information,
it Variance
iteration context,
iterative Centrality:
J theory
Javascript containing
JSON Clustering.
jumps grouping
K Values.
Katz Detection
Key networks,
key-value learning,
Knowledge Descent
known Platform
Kolmogorov|EM|Smirnov information:
K-S theory,
L calculation
Label information
Labeled x
Lambda =
large \le
largest Binning
Latent into
learn number
learned F)
learning case
learning, between
Learning. centered
Learning: on
LearningA \sigma)
Least \mu)^{2}}{2i
left: Fields
Length Coefficient
like category
limits, Algorithmic
line-delimited. Important
LineFile user-guided
List Directions
list. shorthand,
literal calling
Live input
Load intersection
Load: direction
Loads describe
Local Kolmogorov|EM|Smirnov
logic, reference:
Loopy algorithms
Loss researchers
Machine Stanford:
machine-learning Allocation
made procedure
makes format
manipulate fields,
Many by
Map Precision
map. recognition
mapping Function.
MapReduce Characteristic
MapReduce. or
marketing Framework
MarketingA (iterative
Markov iteration
mathematical refers
mathematics provide
mathematics, science,
Matrices type
matrix study
matrix). that,
Maturity system
max sensor
may Notation
mean bell-shaped
mean, marketing
mean. move
means DataAggregated
measure (incorrect)
measurement feature)-
Meets finite
member subset
membership measurement
method tuple)
method) independent
methods record)
metric non-trivial
min corresponding
minimizing Usually
mining unlabeled
Missing made
Model (see
model. learn
Modeling which
models largest
MOLAP standard
more result
most Tags.
move section
multidimensional Propagation.
multi-pass “bias”
multiple training,
N tabular
navigates predicting
needs, implements
negative Factorization
negative) (often
Neighborhood see:
network (DAG)
Network, Graph.
Network. either
networks, \frac
nonparametric t\}.
non-trivial specify
Normal places
not Load
Notation outside
number fit
number, commonly
numbers, values,
O two
Object around
observations \sigma^2}}
of mean
often walk-through
OLAP attributes
OLTP interactions,
Omnichannel frame’s
on like
One any
Online Test:
On-Line Analysis
Operating (LP).
operational have
or explanations
organized often
other programming
outside measure
over counting
P System:
PageRank algorithms,
PageRank. reduce
PairA method)
Parameter Processing:
parameter, consisting
particular, Degree:
partitions Distribution:
Party because
Path user
pattern receiving
patterns (JSON)A
performance common
places strategy
Platform data,
Plural uses
PortabilityAbility Customer
positive predictions
positive) specification
Precision several
precision. interpretation
Precision/Recall learned
predicted synonymous
predicting instances
predictions (mean)
prior (most
probabilistic deviation
probability may
problem Code
procedure representing
process Meets
Processing sub-graph
Processing. step
Processing: (x_{1},x_{2},
programming {x_{i}
Propagation indicator
Propagation. each
Property column
proportion sources
provide operational
Python end
quality metric
quantifies 32
Quantile 64
quantity rows
R side
Random defined
randomized \frac{1}{
rate deviation.
RDF vertices
reaches Coefficient.
real methods
Recall: index
Receiver indexes
receiving this:
recognition Hadoop
Recommendation prior
Recommender member
Record (SNA)
record) trained
records found
reduce Learning:
Refer parameter,
reference: Allocation:
refers allows
Regressor finding
Related branch
Related: approach
relationship Recall:
relative Tinkerpop:
Relaxation (UDF)
replacing Characteristic:
representation storing
representing defines
researchers most
Resource Edge,
Resubstitution graphs
result randomized
retrieval behavior
right: organized
ROC Artificial
ROLAP) especially
Row find
rows quantity
S showing
sample estimating
Sample. improving
Sampling identically
Sampling. model.
ScalabilityAbility set’s
scale matrix).
Schema relationship
science, unique
Score positive)
section negative)
see adding
see: be
Semi-Supervised accuracy,
Sensitivity problem
sensor probability
set analysis,
sets (numbers
set’s membership
several based
shorthand, describing
showing algorithm.
side iterative
signal …
Simple Load:
single needs,
Smoothing target
So bits
Social limits,
Some mean,
somewhat as:
sources displaying
special article
specific terminology
Specifically, “The
specification database,
Specificity float
specify there
Squares vertices.
standard literal
Stanford: sets
Statistics records
statistics, MapReduce.
stdev Network,
step importance
steps PageRank.
storing retrieval
strategy key-value
Stratified (sometimes
string graphs.
structure sample
study “noise”
sub-graph way
subset (DAG).
such (or
suffix volume
sum multiple
summarizes tracks
Superstep Intelligence
Supersteps displays
Supervised large
synonymous derived
system automatically
System: conjunctive
systems discrete
T numbers,
t\}. distributed
t}{n} identifying
table not
Tab-Separated summarizes
tabular MOLAP
Tags computed
Tags. all
target at
Techniques conditional
Tendency Network.
term while
term: distribution.
terminology Specifically,
Test Neighborhood
Test: calculation)
TestingAnalysis (mathematics).
text) (graph
that x_{n}),
that, t}{n}
The event
theory (database,
theory). integrate
theory, quantifies
there columns
Third when
This \sqrt{2
this: vertices,
Tinkerpop: suffix
to (the
together number,
Topic format.
topology, Types:
tracks Learning.
trade-off best-fitting
trained makes
training, answering
Transaction map.
Transactional partitions
Transform, replacing
Transforms (SRS)
Traversals Sampling.
triplets string
True valency)
Trusted customers
Tuple allow
tuple) databases
two other
type (AI)
Types: how
typical amounts
UDF scale
Undirected implication
Unicode predicted
unique quality
unlabeled somewhat
Unsupervised together
use features
used instances.
user valid,
User-defined known
user-guided (multi-dimensional
uses list.
Usually over
valency) multidimensional
valid, Filter
value different
Values has
values, Interchange,
Values. steps
var dependencies
variable, “variance”
Variables minimizing
Variance text)
various patterns
vector performance
vector. (pdf)
Verb reaches
Vertex theory).
vertices formed
vertices, jumps
vertices. A.
volume groups
vs precision.
walk-through manipulate
Warning mean.
way \pi}}
when triplets
Where examine
which such
while navigates
Width relative
Wikipedia: distributed,
will assumption
with “prior”
works nonparametric
x various
x_{n}), examples,
You code.
zero curve
? line-delimited.

Data Science Definition. What is Data Science?

There is much debate on it, but the short definition of data science is:

“Data science is an interdisciplinary field of using scientific methods to get information from data in various forms.”

Data science involves using methods from fields of statistics, computer science and mathematics, to interpret data for business decisions. Amounts of data available in modern society grow with the input of technology in peoples lives. These massive sets of structured and unstructured data help to show patterns and trends for business opportunities or academic research. One can see an increasing number of traditional fields of science with adjective ‘computational’ or ‘quantitative’. In industry, data science transforms everything from healthcare to media and this trend shall pick up in future.

Machine Learning Definition

Machine learning is subfield of science, that provides computers with the ability to learn without being explicitly programmed.   The goal of machine learning is to develop learning algorithms, that do the learning automatically without human intervention or assistance, just by being exposed to new Data Science The machine learning paradigm can be viewed as “programming by example”. This subarea of artificial intelligence intersects broadly with other fields like, statistics, mathematics, physics, theoretical computer science and more.