How to use python for the first machine learning project (detailed tutorial)

Do you want to use python for machine learning but it is difficult to get started?

In this tutorial, you will complete your first machine learning project in Python.

In the following tutorial, you will learn:

1. Download and install Python SciPy to install the most useful packages for machine learning in Python.

2. Load the dataset and understand its structure using statistical summaries and data visualization.

3. Create 6 machine learning models and pick the best model to ensure accuracy.

If you are a machine learning beginner and want to start using Python for your machine learning project, this tutorial is tailored for you.

I don’t have much to say, let’s start with the topic.

How to start machine learning with Python?

The best way to learn machine learning is to design and complete small projects.

Difficulties encountered when getting started with Python

Python is a popular and powerful interpreted language. Unlike R, Python is a complete language and platform that can be used for research and development.

There are also many modules and libraries to choose from, offering a variety of ways to accomplish each task.

The best way to start using Python for machine learning is to complete a project.

It will prompt you to install and launch the Python interpreter.

It gives you a comprehensive view of how to develop a small project.

It will give you confidence and perhaps drive you to continue to do your own small projects.

Beginners need a small end-to-end project

Many books and courses are disappointing. They give you a lot of methods and snippets, but you never see how they fit together.

When you apply machine learning to your own dataset, you have already started a project.

The machine learning project may not be linear, but it has many typical steps:

Definition problem

Prepare data

Evaluation algorithm.

Improve your performance.

got the answer.

The best way to really start a new platform or tool is to do end-to-end work through a machine learning project and cover key steps. That is, from loading data, summarizing data, evaluating algorithms, and making some predictions.

If you can do this, you will have a template that can be used on the dataset after the dataset. Once you have more confidence, you can further fill the data and improve the results of the task.

How to use python for the first machine learning project (detailed tutorial)

Machine Learning Hello World

The best small project to start using the new tool is the classification of irises (such as the iris dataset https://archive.ics.uci.edu/ml/datasets/Iris).

This is a well understood project.

Attributes are numeric, so you have to figure out how to load and process the data.

This is a classification problem that allows you to practice simpler supervised learning algorithms.

This is a multi-class classification problem (polynomial) that may require some special processing.

It has only 4 attributes and 150 lines, which means it's small and easy to adapt to memory (and screen or A4 pages).

All numeric attributes are the same unit and the same scale, and can be started without any special scaling or transformation.

Let's start using the hello world machine learning project in Python.

Machine Learning in Python: Step-by-Step Tutorial

In this section, we will work through an end-to-end small machine learning project.

Here's what we're going to cover:

Install Python and SciPy platform

Load data set

Summary data set

Visual data set

Evaluate some algorithms

Make some predictions

Take your time and do it step by step.

You can try to enter the command yourself or copy and paste to speed up.

1

Download, install and start Python SciPy

If you are not already installed on your system, please install Python and SciPy platform.

I don't want to introduce this in too much detail, because someone else has already introduced it, which is very simple for a developer.

1.1 Installing the SciPy Library

This tutorial assumes a Python version of 2.7 or 3.5.

You need to install 5 key libraries. The following is a list of the Python SciPy libraries required for this tutorial:

SciPy

Numpy

Matplotlib

Pandas

Sklearn

There are many ways to install these libraries, and my advice is to choose a method and then be consistent when installing each library.

The SciPy installation page (https://) provides excellent specifications for many different platforms, such as Linux, Mac OS X and Windows. If you have any questions or concerns, please refer to this instruction.

On Mac OS X, you can use macports to install Python 2.7 and these libraries.

On Linux, you can use a package manager, such as yum on Fedora, to install RPM.

If you use Windows or you have no confidence, I recommend installing the free version of Anaconda (https://), which contains everything you need.

Note: This tutorial assumes that you have installed scikit-learn version 0.18 or higher.

1.2 Start Python and check the version

It is necessary to ensure that your Python environment is installed successfully and works as expected.

The script below will help you test your environment. It imports each library you need in this tutorial and prints out the version.

Open the command line and start the python interpreter:

1 Python

I recommend working directly in the interpreter, or writing scripts and running them on the command line instead of using large editors and IDEs. Don't be very complicated, put the center on machine learning instead of the tool chain.

Type or copy and paste the following script:

01 # Check the versions of libraries
02
03 # Python version
04 Import sys
05 Print('Python: {}'.format(sys.version))
06 # scipy
07 Import scipy
08 Print('scipy: {}'.format(scipy.__version__))
09 # numpy
10 Import numpy
11 Print('numpy: {}'.format(numpy.__version__))
12 # matplotlib
13 Import matplotlib
14 Print('matplotlib: {}'.format(matplotlib.__version__))
15 # pandas
16 Import pandas
17 Print('pandas: {}'.format(pandas.__version__))
18 # scikit-learn
19 Import sklearn
20 Print('sklearn: {}'.format(sklearn.__version__))

This is the output I got on my OS X workstation:

1 Python: 2.7.11 (default, Mar 1 2016, 18:40:10)
2 [GCC4.2.1 Compatible Apple LLVM7.0.2 (clang-700.1.81)]
3 Scipy: 0.17.0
4 Numpy: 1.10.4
5 Matplotlib: 1.5.1
6 Pandas: 0.17.1
7 Sklearn: 0.18.1

Compare the above output to your version.

Ideally, your version should match or update. These APIs don't change very quickly, so if your version is higher, don't worry, everything in this tutorial is likely to still apply to you.

If you get an error, stop. Now is the time to fix it.

If you are unable to run the above scripts properly, you will not be able to complete this tutorial.

My best advice is to search for your error message on Google .

2

Download Data

We will use the iris dataset. This data set is well known because it is used as a "hello world" in machine learning and statistics.

The data set contains 150 iris observations. There are four columns to measure the size of the flower. The fifth column is the type of flower observed. All observed flowers belong to one of three species.

In this step, we will load the tail data from the URL of the CSV file.

2.1 import library

First, we'll import all the modules, functions, and objects we'll use in this tutorial.

01 # Load libraries
02 Import pandas
03 From pandas.tools.plottingimport scatter_matrix
04 Import matplotlib.pyplot as plt
05 From sklearnimport model_selection
06 From sklearn.metricsimport classification_report
07 From sklearn.metricsimport confusion_matrix
08 From sklearn.metricsimport accuracy_score
09 From sklearn.linear_modelimport LogisticRegression
10 From sklearn.treeimport DecisionTreeClassifier
11 From sklearn.neighborsimport KNeighborsClassifier
12 From sklearn.discriminant_analysisimport LinearDiscriminantAnalysis
13 From sklearn.naive_bayesimport GaussianNB
14 From sklearn.svmimport SVC

These loads are normal under normal conditions. If an error occurs, stop. Going back to the above, you need a viable SciPy environment. See the suggestions above for setting up your environment.

2.2 Loading the data set

We can load data directly from the UCI machine learning repository.

We are using pandas to load data. We will also use pandas to explore data with descriptive statistics and data visualization.

Note that we specified the name of each column when loading the data. This will help us study the data later.

1 # Load dataset
2 Url= "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
3 Names= ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
4 Dataset= pandas.read_csv(url, names=names)

The data set should load very smoothly

If you have network problems, you can download the iris data (https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data). Put the file in your working directory and load it in the same way, changing the URL to a local filename.

3

Summary data set

It's time to look at the data.

In this step, we will look at the data in several different ways:

The dimension of the dataset.

Look carefully at the data itself.

A statistical summary of all attributes.

Subdivide data by class variable.

Remember to view the data, a data set is a command. These are useful commands that you can use repeatedly in future projects.

3.1 Size of the data set

We can quickly see through the shape attribute how many instances (rows) and how many attributes (columns) are included in the data.

1 # shape
2 Print(dataset.shape)

You should see 150 instances and 5 attributes:

1 (150,5)

3.2 Observation data

Look closely at your data

1 # head
2 Print(dataset.head(20))

You should see the first 20 lines of data:

01 Sepal-length sepal-width petal-length petal-width class
02 0 5.1 3.5 1.4 0.2 Iris-setosa
03 1 4.9 3.0 1.4 0.2 Iris-setosa
04 2 4.7 3.2 1.3 0.2 Iris-setosa
05 3 4.6 3.1 1.5 0.2 Iris-setosa
06 4 5.0 3.6 1.4 0.2 Iris-setosa
07 5 5.4 3.9 1.7 0.4 Iris-setosa
08 6 4.6 3.4 1.4 0.3 Iris-setosa
09 7 5.0 3.4 1.5 0.2 Iris-setosa
10 8 4.4 2.9 1.4 0.2 Iris-setosa
11 9 4.9 3.1 1.5 0.1 Iris-setosa
12 10 5.4 3.7 1.5 0.2 Iris-setosa
13 11 4.8 3.4 1.6 0.2 Iris-setosa
14 12 4.8 3.0 1.4 0.1 Iris-setosa
15 13 4.3 3.0 1.1 0.1 Iris-setosa
16 14 5.8 4.0 1.2 0.2 Iris-setosa
17 15 5.7 4.4 1.5 0.4 Iris-setosa
18 16 5.4 3.9 1.3 0.4 Iris-setosa
19 17 5.1 3.5 1.4 0.3 Iris-setosa
20 18 5.7 3.8 1.7 0.3 Iris-setosa
twenty one 19 5.1 3.8 1.5 0.3 Iris-setosa

3.3 statistical summary

Now we can look at the summary of each property.

This includes counts, averages, minimums and maximums as well as some percentiles.

1 # descriptions
2 Print(dataset.describe())

We can see that all values ​​have the same unit (cm) and range from 0 to 8 cm.

1 Sepal-length sepal-width petal-length petal-width
2 Count 150.000000 150.000000 150.000000 150.000000
3 Mean 5.843333 3.054000 3.758667 1.198667
4 Std 0.828066 0.433594 1.764420 0.763161
5 Min 4.300000 2.000000 1.000000 0.100000
6 25% 5.100000 2.800000 1.600000 0.300000
7 50% 5.800000 3.000000 4.350000 1.300000
8 75% 6.400000 3.300000 5.100000 1.800000
9 Max 7.900000 4.400000 6.900000 2.500000

3.4 Classification

Now let's look at the number of instances (rows) belonging to each class. We can think of it as an absolute number.

1 # class distribution
2 Print(dataset.groupby('class').size())

We can see that each class has the same number of instances (50 or 33% of the data set).

1 Class
2 Iris-setosa 50
3 Iris-versicolor 50
4 Iris-virginica 50

4

data visualization

We now have a basic understanding of the data. We need some visualization to make ourselves more aware of it.

We have to look at two figures:

Univariate graphs give you a better understanding of each property.

Multivariate graphs give you a better understanding of the relationships between attributes.

4.1 univariate map

We start with some single variables, the curve for each variable.

Since the input variables are numeric, we can create a box plot for each input variable.

1 # box and whisker plots
2 Dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
3 Plt.show()

This gives us a clearer idea of ​​the distribution of input properties, and we can also create a histogram of each input variable to get the concept of the distribution.

1 # histograms
2 Dataset.hist()
3 Plt.show()

It seems that there may be two input variables with a Gaussian distribution. This is useful because we can use this assumption to compare the accuracy of the algorithm.

4.2 multivariate graph

Now we can look at the interaction between variables.

First, let's take a look at the scatter plot of all the attribute pairs. This can help to discover the structured relationship between input variables.

1 # scatter plot matrix
2 Scatter_matrix(dataset)
3 Plt.show()

Note these groups that are close to the diagonal, which is a highly correlated and predictable relationship.

5

Evaluation algorithm

Now create some data models and evaluate their accuracy for future data forecasts.

Here's what we're going to discuss:

Extract a verification data set.

Set up the test tool to use 10x cross-validation.

Five different models were created to predict the types of flower measurements.

Choose the best model.

5.1 Creating a Validation Data Set

We need to know what the model we create is useful.

After that, we will use statistical methods to estimate the accuracy of the model we created on the predicted data. We also hope to estimate the accuracy of the best model more specifically by evaluating the actual forecast data.

That is, we will keep some data that the algorithm can't see, and we'll use that data to determine how accurate the model is.

We will divide the loaded data set into two parts, 80% of which will be used to train our model and 20% will be used as the validation data set.

1 # Split-out validation dataset
2 Array= dataset.values
3 X= array[:,0:4]
4 Y= array[:,4]
5 Validation_size= 0.20
6 Seed= 7
7 X_train, X_validation, Y_train, Y_validation= model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

You can now train the data in X_train and Y_train to prepare the model and the X_validation and Y_validation sets, as we will use them later.

5.2 Test Tools

We will use 10x cross-validation to estimate the accuracy.

This will divide our data set into 10 parts, train on 9 and test on 1 and repeat all combinations of training groups.

1 # Test options and evaluation metric
2 Seed= 7
3 Scoring= 'accuracy'

We use the measure of "acoustic" to evaluate the model. This is the ratio of the number of correctly predicted instances divided by the total number of instances in the data set multiplied by 100 (for example, 95% accurate). When we run the build and evaluate each model, we will use the scoring variable.

5.3 Building a model

We don't know which algorithms are good for this problem or what configuration. We draw some thoughts from the graph that some classes are partially linearly separable in some respects, so we expect the general results to be good.

Let's evaluate 6 different algorithms:

Logistic regression (LR)

Linear Discriminant Analysis (LDA)

Proximity algorithm (KNN).

Classification and Regression Tree (CART).

Gauss Sims Bayes (NB).

Support Vector Machine (SVM).

This is a good mix of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms. We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using the exact same data split. It ensures that the results are directly comparable.

Let's build and evaluate our five models:

01 # Spot Check Algorithms
02 Models= []
03 Models.append(('LR', LogisticRegression()))
04 Models.append(('LDA', LinearDiscriminantAnalysis()))
05 Models.append(('KNN', KNeighborsClassifier()))
06 Models.append(('CART', DecisionTreeClassifier()))
07 Models.append(('NB', GaussianNB()))
08 Models.append(('SVM', SVC()))
09 # evaluation each model in turn
10 Results= []
11 Names= []
12 For name, modelin models:
13 Kfold= model_selection.KFold(n_splits=10, random_state=seed)
14 Cv_results= model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
15 Results.append(cv_results)
16 Names.append(name)
17 Msg= "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
18 Print(msg)


5.4 choose the best model

We now have 6 models and accuracy estimates for each. We need to compare the models to each other and choose the most accurate ones.

Running the above example, we get the following raw results:

LR: 0.966667 (0.040825) LDA: 0.975000 (0.038188) KNN: 0.983333 (0.033333) CART: 0.975000 (0.038188) NB: 0.975000 (0.053359) SVM: 0.981667 (0.025000)

We can see that it seems that KNN has the highest estimated accuracy score.

We can also create graphs of model evaluation results and compare the differences and average accuracy of each model. Each algorithm has a population of precise metrics because each algorithm is evaluated 10 times (10 cross-validations).

1 # Compare Algorithms
2 Fig= plt.figure()
3 Fig.suptitle('Algorithm Comparison')
4 Ax= fig.add_subplot(111)
5 Plt.boxplot(results)
6 Ax.set_xticklabels(names)
7 Plt.show()


You can see that the box plot is flattened at the top and many samples achieve 100% accuracy.

6

Make predictions

The KNN algorithm is the most accurate model we tested. Now we want to understand the accuracy of the model on the validation set.

This allows us to perform an independent final check on the accuracy of the best model. It's useful to keep a validation set in case you make a mistake during training, such as over-fitting or data leaking. Both will lead to overly optimistic results.

We can run the KNN model directly on the validation set and summarize the results into final accuracy scores, confusion matrices, and classification reports.

1 # Make predictions on validation dataset
2 Knn= KNeighborsClassifier()
3 Knn.fit(X_train, Y_train)
4 Predictions= knn.predict(X_validation)
5 Print(accuracy_score(Y_validation, predictions))
6 Print(confusion_matrix(Y_validation, predictions))
7 Print(classification_report(Y_validation, predictions))


We can see that the accuracy is 0.9 or 90%. The confusion matrix provides three indications of errors. Finally, the classification report provides a breakdown of each category through accuracy, recall, f1 scores, and support showing excellent results (giving a small validation data set).

View source

01 0.9
02
03 [[7 0 0]
04 [0 11 1]
05 [0 2 9]]
06
07 Precision recall f1-score support
08
09 Iris-setosa 1.00 1.00 1.00 1.00
10 Iris-versicolor 0.85 0.92 0.88 12
11 Iris-virginica 0.90 0.82 0.86 11
12
13 Avg/ total 0.90 0.90 0.90 30

After completing the above tutorial, it only takes 5 to 10 minutes.

7

summary

In this article, you will gradually discover how to complete the first machine learning project in Python.

You will find that completing a small end-to-end project and loading data into the forecast is the best way to become familiar with the new platform.

Guii Labs Vape

Guii Labs Vape,Fume Extra 1500 Puffs ,Disposable Vapor Stick ,Vape Pen Pop

Tsvape E-cigarette Supplier Wholesale/OEM/ODM , https://www.tsecigarette.com