Welcome Back! Featuring Plotting & Supercomputers
from pathlib import Path
DATA_DIR = Path("/kaggle/input")
if (DATA_DIR / "ucfai-core-fa18-welcome-back").exists():
DATA_DIR /= "ucfai-core-fa18-welcome-back"
elif DATA_DIR.exists():
# no-op to keep the proper data path for Kaggle
pass
else:
# You'll need to download the data from Kaggle and place it in the `data/`
# directory beside this notebook.
# The data should be here: https://kaggle.com/c/ucfai-core-fa18-welcome-back/data
DATA_DIR = Path("data")
Welcome to SIGAI
What is it?
We’re the AI club on campus. Focusing on both industry skills and getting involved in research, we’re here to foster the next generation of AI researchers and engineers at UCF.
Who are we?
- John Muchovej
- Chas Kane
- Evan Waldmann
- Aidan Lakshman
- Richard DiBacco
Our goals:
- Break away the ivory tower imposed around AI
- Provide lecture-workshops in which you’ll leave with skills you can immediately use
- Cultivate an interest in many facets of artificial intelligence and machine learning
- Form, and foster, a competitive data science team which participates in competitions and also works with local businesses and governments.
The Master Plan (for Fall 2018)
Unit 0: Basics
- Welcome Back! Featuring Plotting & Supercomputers
- Intro to Data Analysis with Pandas & NumPy
- Let Data Speak Using Regression & Plots
Unit 1: Neural Networks
- Neural Networks & Inuition, Using PyTorch
- De-convolving Neural Networks
- Programming Dies, as Machines Learn to Code (Recurrent Neural Networks)
- Generative Adversarial Networks // Variational AutoEncoders (TBD)
Unit 2: Time-Traveling
- Heuristics
- Decision Trees
- Support Vector Machines
Unit 3: Applied
- 2-4 workshops where we, as a group, tear apart datasets, applying what we’ve learned and share insights/conclusions drawn
Accessing Newton
Newton is the name of UCF’s GPU cluster. It’s where we’ll conduct all our lectures, and you’ll also have access to the cluster throughout the semester; although we strongly request that you contact John prior to running anything on your own through your sigai.student
account.
Getting Your Keys
This section is to be spoken, only. Not present anywhere except the lecture room.
With the Keys: MacOS / Linux First
We’ll be assuming you placed your keys in ~/Downloads
(your user’s Downloads directory).
mkdir ~/.ssh/
unzip Downloads/sigai.student<N>.zip
cp Downloads/sigai.student<N>/sig* ~/.ssh/
cat Downloads/sigai.student<N>/config >> ~/.ssh/config
cat ~/.ssh/sigai.student<N>_pass.txt
Copy this into your clipboard, you’ll need it in the next step!- Open a Terminal, and try this:
ssh sigai.newton
- You’ll be prompted for the passphrase you copied to your clipboard earlier. Paste it and hit “Enter”
If all went well, you’ll be greeted with this:
+===================================================+
| Welcome to the Newton HPC at the UCF Advanced |
| Research Computing Center. |
|---------------------------------------------------|
| Problems? Email: <redated> |
+===================================================+
[sigai.student<N>@<redacted> ~]$
With the Keys: Windows
We’ll be assuming you placed your keys in ~/Downloads
(your user’s Downloads directory).
- Download PuTTy – use ucfsigai.org/tools/putty, also downlo
- Open
File Explorer
and unzipsigai.student<N>.zip
- Double-click
sigai.student<N>.ppk
– you’ll be prompted for a passphrase, this is insigai.student<N>_pass.txt
, copy and paste into the prompt.
Now that we’re all logged in…
git clone https://github.com/ucfai/core
cd core/fa18
./launch.sh
- Follow the instructions printed out by
./launch.sh
and openlocalhost:19972
on your browser. - Navigate to
08-29-welcome-back/welcome-back.ipynb
and open it – you should find yourself in this very notebook!
Get a Taste of Python and Plotting
In order to understand what your data is and how dirty is it (because all data is dirty), you need to understand what that data looks like. Through plotting you can easily see trends and outliers that will help you determine if your data is what you think it is.
First things first, you have to download your find what package your comfortable with there are loads, so feel free to look around. For the purposes of this meeting we are going to be using plotnine because it is basically the same as ggplot2 from R which is a very powerful library that is widely used in many different fields (there is a ggplot package as well, but that was giving me some issues).
If you take a look at fa18.env.yml
, you’ll find the Anaconda Environment (very similar to pip virtualenv
) we’ll be using for the semester – you don’t need to worry about this for the semester, though.
Then you have to import the libraries and read in your data.
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
from plotnine import *
df = pd.read_csv(DATA_DIR / 'iris/Iris.csv')
df.head()
So now we have our data. We also want our data in this format, which is normal referred to as a data frame. Each row is an entry of data and each column is a type of data that was collected. With this, we can now start graphing.
ggplot(aes(x='Species', y='SepalWidthCm', color='Species'), data=df) + \
geom_boxplot(aes(color='Species')) + \
xlab("Species") + \
ylab("Width") + \
ggtitle("Comparing Sepal length and width across different species")
You can that we can get some nice looking graphs very easily. But what if we wanted to customized them more?
(ggplot(df, aes('Species','SepalLengthCm')) +
geom_boxplot(aes(color=('Species'))) +
labs(title='graph title', x='x title', color='legend title', y='y title') +
scale_x_discrete(labels=['setosa', 'versicolor', 'virginica']) +
scale_color_manual(labels=['setosa', 'versicolor', 'virginica'], values=['#000000', '#9ebcda', '#8856a7'])
+ theme_bw())
The customization is also pretty easy, but one of the things that most non statistician love about R is that they can add a fitted linear model line with about 25 characters.
(ggplot(df, aes(x='SepalWidthCm', y='SepalLengthCm', color=('Species'))) +
geom_point() + stat_smooth(method='lm') +
labs(title="graph title", x="x title", color="legend title", y="y title") +
scale_color_manual(labels=['one', 'two', 'three'], values=['#000000', '#9ebcda', '#8856a7']) +
theme_classic())
Sometimes, seperating the data by category can help you understand the trends better. This is called faceting. Note that since ggplot started in R changing the names of individual graphs is somewhat hard; an easy work around for this is to change the string factors in the data since the names depend on the species names from data.
(ggplot(df, aes(x='SepalWidthCm', y='SepalLengthCm', color=('Species'))) +
geom_point() + stat_smooth(method='lm') + facet_wrap('~Species') +
labs(title="graph title", x="x title", color="legend title", y="y title") +
scale_color_manual(values=['#000000', '#9ebcda', '#8856a7']) +
theme_bw())
Now lets try a different dataset.
pkdata = pd.read_csv(DATA_DIR / 'pokemon/Pokemon.csv')
pkdata.rename(lambda x: str(x).replace(" ", ""), axis="columns", inplace=True)
pkdata.head()
(ggplot(pkdata, aes('Legendary', 'Attack')) + geom_boxplot() +
labs(title='Legendary Pokemon have a higher distribution of\n attack values than non-Legendary Pokemon') +
theme_xkcd())
(ggplot(pkdata, aes('Attack', 'Defense', color="Generation")) +
geom_point() +
theme_xkcd())
(ggplot(pkdata, aes('Attack', 'Defense', color="Generation")) +
geom_point() + facet_wrap('~Type1') +
theme_xkcd())
(ggplot(pkdata, aes('Legendary', 'Total')) + geom_boxplot() +
labs(title='Sum of Total Stats Separated by\n Type and Legendary Status') +
facet_wrap('~Type1') +
theme_xkcd())
Now what if we wanted to save this graph?
p = (ggplot(pkdata, aes('Legendary', 'Total')) + geom_boxplot() +
labs(title='Sum of Total Stats Separated by\n Type and Legendary Status') +
facet_wrap('~Type1') +
theme_xkcd())
p.save(filename='pokemanPlot.png', height=5, width=5, units='in', dpi=150)
Now you have the basics of graphing. Keep in mind that this library translates almost exactly into code for ggplot2
in R
. If you want to refine your skills, the best thing you can do is to find a graph that you think is cool and try to make it exactly. Doing this, you will quickly learn the quarks of ggplot
.
You can check out these links for more information on plotnine:
- https://www.kaggle.com/residentmario/grammar-of-graphics-with-plotnine-optional?scriptVersionId=4327772
- http://pltn.ca/plotnine-superior-python-ggplot/
And you can check out these for some more help with graphing using ggplot
: