A Random Walk through a Random Forest

From Wikipedia, the free encyclopedia

A Random Walk Down Wall Street, written by Burton Gordon Malkiel, a Princeton economist, is a book on the subject of stock markets which popularized the random walk hypothesis. Malkiel argues that asset prices typically exhibit signs of random walk and that one cannot consistently outperform market averages. The book is frequently cited by those in favor of the efficient-market hypothesis. As of 2015, there have been eleven editions and over 1.5 million copies sold.[1] A practical popularization is The Random Walk Guide to Investing: Ten Rules for Financial Success.[2]


A random walk is a mathematical object, known as a stochastic or random process, that describes a path that consists of a succession of random steps on some mathematical space such as the integers. An elementary example of a random walk is the random walk on the integer number line,\mathbb {Z} , which starts at 0 and at each step moves +1 or −1 with equal probability. Other examples include the path traced by a molecule as it travels in a liquid or a gas, the search path of a foraging animal, the price of a fluctuating stock and the financial status of a gambler can all be approximated by random walk models, even though they may not be truly random in reality. As illustrated by those examples, random walks have applications to many scientific fields including ecology, psychology, computer science, physics, chemistry, biology as well as economics. Random walks explain the observed behaviors of many processes in these fields, and thus serve as a fundamental model for the recorded stochastic activity. As a more mathematical application, the value of pi can be approximated by the usage of random walk in agent-based modelling environment.[b1][b2] The term random walk was first introduced by Karl Pearson in 1905.[b3]
Various types of random walks are of interest, which can differ in several ways. The term itself most often refers to a special category of Markov chains or Markov processes, but many time-dependent processes are referred to as random walks, with a modifier indicating their specific properties. Random walks (Markov or not) can also take place on a variety of spaces: commonly studied ones include graphs, others on the integers or the real line, in the plane or in higher-dimensional vector spaces, on curved surfaces or higher-dimensional Riemannian manifolds, and also on groups finite, finitely generated or Lie. The time parameter can also be manipulated. In the simplest context the walk is in discrete time, that is a sequence of random variables (Xt) = (X1, X2, ...) indexed by the natural numbers. However, it is also possible to define random walks which take their steps at random times, and in that case the position Xt has to be defined for all times t ∈ [0,+∞). Specific cases or limits of random walks include the Lévy flight and diffusion models such as Brownian motion.
Random walks are a fundamental topic in discussions of Markov processes. Their mathematical study has been extensive. Several properties, including dispersal distributions, first-passage or hitting times, encounter rates, recurrence or transience, have been introduced to quantify their behavior.



The article will examine the use of well known analysis metrics as feature sets for a Random Forest.


MFI is used to measure the "enthusiasm" of the market. In other words, the money flow index shows how much a stock was traded.
A value of 80 or more is generally considered overbought, a value of 20 or less oversold. Divergences between MFI and price action are also considered significant, for instance if price makes a new rally high but the MFI high is less than its previous high then that may indicate a weak advance that is likely to reverse.

The StochRSI is an indicator used in technical analysis that ranges between zero and one and is created by applying the Stochastic Oscillator formula to a set of Relative Strength Index (RSI) values rather than standard price data. Using RSI values within the Stochastic formula gives traders an idea of whether the current RSI value is overbought or oversold - a measure that becomes specifically useful when the RSI value is confined between its signal levels of 20 and 80.

Kaufman's Adaptive Moving Average (KAMA) is a moving average designed to account for market noise or volatility. KAMA will closely follow prices when the price swings are relatively small and the noise is low. KAMA will adjust when the price swings widen and follow prices from a greater distance. This trend-following indicator can be used to identify the overall trend, time turning points and filter price movements.

def get_indicators(stocks, period):
stocks_indicators = {}
for i in stocks:
features = pd.DataFrame(SMA(stocks[i], timeperiod=10))
features.columns = ['sma_10']
features['mfi_10'] = pd.DataFrame(MFI(stocks[i], timeperiod=10))
features['mom_10'] = pd.DataFrame(MOM(stocks[i],10))
features['wma_10'] = pd.DataFrame(WMA(stocks[i],10))
features['ultosc_4'] = pd.DataFrame(ULTOSC(stocks[i],timeperiod1=4, timeperiod2=7, timeperiod3=14))
features = pd.concat([features,STOCHF(stocks[i],
fastk_period=14,
fastd_period=3)],
axis=1)
features['macd'] = pd.DataFrame(MACD(stocks[i], fastperiod=5, slowperiod=14)['macd'])
features['rsi'] = pd.DataFrame(RSI(stocks[i], timeperiod=14))
features['willr'] = pd.DataFrame(WILLR(stocks[i], timeperiod=14))
features['cci'] = pd.DataFrame(CCI(stocks[i], timeperiod=14))
features['adosc'] = pd.DataFrame(ADOSC(stocks[i], fastperiod=3, slowperiod=10))
features['raw_pct_change'] = ROCP(stocks[i], timeperiod=period)
features['raw_pct_change'] = features['raw_pct_change'].shift(-period)
features['pct_change'] = features['raw_pct_change'].apply(lambda x: '1' if x > 0.03 else '0')
features = features.dropna()
stocks_indicators[i] = features
return stocks_indicators

>>> pred = clf.predict(test[features])
>>> list(zip(train[features], clf.feature_importances_))
[('sma_10', '0.08805410862396797'), ('mfi_10', '0.09451084071039871'), ('mom_10', '0.069575191927391'), ('wma_10', '0.13220290244507055'), ('ultosc_4', '0.06845575657410827'), ('fastk', '0.05566000353802879'), ('fastd', '0.09610385618777093'), ('macd', '0.07816429691621057'), ('rsi', '0.06785865605512234'), ('willr', '0.07602636981442927'), ('cci', '0.06132717952251619'), ('adosc', '0.1120608376849855')]
>>> pd.crosstab(test['pct_change'],pred)
col_0 0 1
pct_change
0 142 7
1 30 11


Crowd Programming Engine (CPE)

"A Distributed Agent-Based system with PyClips, RandomForests and B-D-I connected over a shared Tuple-Space powered by google spreadsheets or something better when we find it." - Charles Kosta, Sc.D.

Random Forest of Irises



cpe.net:~/workspace $ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from sklearn.datasets import load_iris

>>>
>>> from sklearn.ensemble import RandomForestClassifier
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>> iris = load_iris()
>>> df = pd.DataFrame(iris.data, columns=iris.feature_names)
>>> df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
>>> df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
>>> df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
>>> df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
>>> df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species is_train
0 5.1 3.5 1.4 0.2 setosa True
1 4.9 3.0 1.4 0.2 setosa True
2 4.7 3.2 1.3 0.2 setosa True
3 4.6 3.1 1.5 0.2 setosa True
4 5.0 3.6 1.4 0.2 setosa True
>>> train, test = df[df['is_train']==True], df[df['is_train']==False]
>>> print('Number of observations in the training data:', len(train))
('Number of observations in the training data:', 118)
>>> print('Number of observations in the training data:', len(test))
('Number of observations in the training data:', 32)
>>> features = df.columns[:4]
>>>
>>> features
Index([u'sepal length (cm)', u'sepal width (cm)', u'petal length (cm)',
u'petal width (cm)'],
dtype='object')
>>> y = pd.factorize(train['species'])[0]
>>> y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2])
>>> clf = RandomForestClassifier(n_jobs=2, random_state=0)
>>> clf.fit(train[features], y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
oob_score=False, random_state=0, verbose=0, warm_start=False)
>>> clf.predict(test[features])
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
>>> clf.predict_proba(test[features])[0:10]
array([[1. , 0. , 0. ],
[1. , 0. , 0. ],
[1. , 0. , 0. ],
[1. , 0. , 0. ],
[1. , 0. , 0. ],
[1. , 0. , 0. ],
[1. , 0. , 0. ],
[0.9, 0.1, 0. ],
[1. , 0. , 0. ],
[1. , 0. , 0. ]])
>>> preds = iris.target_names[clf.predict(test[features])]
>>> preds[0:5]
array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa'], dtype='|S10')
>>> test['species'].head()
7 setosa
8 setosa
10 setosa
13 setosa
17 setosa
Name: species, dtype: category
Categories (3, object): [setosa, versicolor, virginica]
>>> pd.crosstab(test['species'], preds, rownames=['Actual Species'], colnames=['Predicted Species'])
Predicted Species setosa versicolor virginica
Actual Species
setosa 13 0 0
versicolor 0 5 2
virginica 0 0 12
>>> list(zip(train[features], clf.feature_importances_))
[('sepal length (cm)', '0.11185992930506346'), ('sepal width (cm)', '0.016341813006098178'), ('petal length (cm)', '0.36439533040889194'), ('petal width (cm)', '0.5074029272799464')]
>>>
>>>

R and Python are joining forces

The most ambitious crossover event of the year—for programmers

Last month, McKinney announced the founding of Ursa Labs, an innovation group intended to improve data-science tools. McKinney will partner with RStudio—Wickham’s employer, which maintains the most popular user interface for R—on the project. The main goals of Ursa Labs are to make it easier for data scientists working in different programming languages to collaborate, and avoid redundant work by developers across languages. In addition to improving R and Python, the group hopes its work will also improve the user experience in other open-source programming languages like Java and Julia.



Getting Started with NetWorkSpaces

We are ultimately trying to use NetWorkSpaces (NWS) as a way for Python, R, and Java agents running on different machines to
communicate and coordinate with one another.

To get started with Python NWS, we'll first have to install the NWS server, and the Python NWS client.



pip install nwsserver (https://pypi.org/project/nwsserver/)

pip install Twisted (https://pypi.org/project/Twisted/)


Next, an NWS server must be started. This can be done from a shell
using the nws command, as follows:

% twistd -ny .local/nws.tac

From another window, start an interactive Python session, and import
the NetWorkSpace class:

% python
>>> from nws.client import NetWorkSpace

Next, create a work space called 'cpe':

>>> ws = NetWorkSpace('cpe')

This is using the NWS server on the local machine. Additional arguments
can be used to specify the hostname and port used by the NWS server if
necessary.

Once we have a work space, we can write data into a variable using the
store method:

>>> ws.store('joe', 17)

The variable 'joe' now has a value of 17 in our work space.

To read that variable we use the find method:

>>> age = ws.find('joe')

which sets 'age' to 17.

Note that the find method will block until the variable 'joe' has a
value. That is important when we're trying to read that variable from a
different machine. If it didn't block, you might have to repeatedly try
to read the variable until it succeeded.

There are times when you don't want to block, but just want to see if
some variable has a value. That is done with the findTry method.

--------------------------------------------------------------------------------------

From another window, start an interactive Python session, and import
the NetWorkSpace class:

% python
>>> from nws.client import NetWorkSpace

Next, connect to a work space called 'cpe':

>>> ws = NetWorkSpace('cpe')

This won't create the 'cpe' work space, since it already exists. Now you
can execute a operation such as:

>>> x = ws.fetch('joe')
>>> x
17

you should see that you got the value 17 from the previous w.store() function;
only this time you fetched it permanently. Now try to execute the the fetch again.

>>> x = ws.fetch('joe')

Your program will now be blocked! in this session, watch it block for a minute, and then execute a 'store' in
the other session:

>>> ws.store('joe', 18)

You should see the previous session return from the fetch command and x should now be 18.


That's a basic sample of this technology, which of course can become much more complex. Keep in mind
that these programs could have been on different machines; or hosted on Amazon.

More importantly, an R program and a python program
and even a Java program could be working cooperatively.




NetWorkSpaces for Python

NetWorkSpaces: a framework to coordinate distributed programs


NetWorkSpaces (NWS) is one way to write parallel programs. It allows you to take advantage of a multi-core machine; multiple virtual machines, as well as cloud-based clusters, using languages such as Python, R, Java, and Matlab. With NetWorkSpaces for Python, you can execute Python functions and programs in parallel using methods very much like the standard Python map function. In some cases, you may be able to parallelize your program in minutes, rather than months.

For example, here's a simple Python NWS script:

          from math import sqrt
          from nws.sleigh import Sleigh
          s = Sleigh()
          for x in s.imap(sqrt, xrange(10)):
              print x
          

It looks pretty simple, but you'll need to be familiar with the imap function in the standard itertools module.




PyClips (Available via PIP)


This module embeds a fully functional CLIPS engine into Python, and gives the developer a more Python-compliant interface to CLIPS without cutting down on functionalities. In fact CLIPS is compiled into the module in its entirety, and most API functions are bound to Python methods. However the direct bindings to the CLIPS library (implemented as the _clips submodule) are not described here: each function is described by an appropriate documentation string, and accessible by means of the help() function or through the pydoc tool. Each direct binding maps to an API provided function. For a detailed reference for these functions see Clips Reference Guide Vol. II: Advanced Programming Guide, available for download at the CLIPS website.

PyCLIPS is also capable of generating CLIPS text and binary files: this allows the user to interact with sessions of the CLIPS system itself. An important thing to know, is that PyCLIPS implements CLIPS as a separated engine: in the CLIPS module implementation, CLIPS ``lives'' in its own memory space, allocates its own objects. The module only provides a way to send information and commands to this engine and to retrieve results from it.