Machine learning

159 views 10:18 am 0 Comments June 3, 2023

Table of Contents

Abstract

Machine learning is a widely used study performed to solve most of the business problems in modern times. Business analytics is a very hot topic nowadays as it is an important field to take down an analysis for many companies in order to revive them or be in the competition.Assignment

Purpose of the project

The main purpose of this project is to analyze the sentiments of smartphone reviews in order to understand the problems of the customers purchasing the smartphones and making an analysis on how the products can be improved based on the problems of the customers.

Approach

In this project, we will use a proper sentiment analyzer to understand the sentiment of the reviews given by a customer.

Findings

The findings of the project will be the main issues faced by a customer after purchasing a smartphone.

Business implications

This project aims to increase the customer’s potential in purchasing the products and focus on the drawbacks of the products which is unobvious. The solution of such issues might be a logic for increasing the market value of the brand and the product.

Introduction

Sentiment analysis has been a great topic in the field of Machine Learning and it is also known as opinion mining. It helps us to determine the emotion behind an opinion given by someone and this approach is found to be helpful in business domain. After 2000, sentiment analysis has become an area of interest in the business domains as it helps to improve the marketing strategies and enable companies’ owners make an efficient decision. Social media has been a great source to study the opinions of customers for the different products sold and their opinions will help us understand the main choice of the customers.

This project mainly focuses on machine learning algorithms to extract a proper sentiment out of reviews given by the customers. Machine learning and AI has been widely used in order to reduce human efforts and increase the prediction efficiencies. Let us understand some of the approaches of Machine learning and how they can be useful in order to solve our problem.

Machine Learning systems can be classified according to the amount and type of supervision they get

during training. There are four major categories: supervised learning, unsupervised learning,

semi supervised learning, and Reinforcement Learning.

Supervised learning

In supervised learning, the training data we feed to the algorithm includes the desired solutions, called labels. A typical supervised learning task is classification. The spam filter is a good example of this: it is trained with many examples’ emails along with their class (spam or ham), and it must learn how to classify new emails.

Another typical task is to predict a target numeric value, such as the price of a car, given a set of features (mileage, age, brand, etc.) called predictors. This sort of task is called regression. To train

the system, we need to give it many examples of cars, including both their predictors and their labels (i.e., their prices).

Some regression algorithms can be used for classification as well, and vice versa. For example,

Logistic Regression is commonly used for classification, as it can output a value that corresponds to the probability of belonging to a given class (e.g., 20% chance of being spam).

Here are some of the most important supervised learning algorithms:

k-Nearest Neighbors

Linear Regression

Logistic Regression

Support Vector Machines (SVMs)

Decision Trees and Random Forests

Neural networks2

Unsupervised learning

In unsupervised learning, as we might guess, the training data is unlabeled. The system tries

to learn without a teacher.

Here are some of the most important unsupervised learning algorithms:

Clustering

k-Means

Hierarchical Cluster Analysis (HCA)

Expectation Maximization

Visualization and dimensionality reduction

Principal Component Analysis (PCA)

Kernel PCA

Locally-Linear Embedding (LLE)

t-distributed Stochastic Neighbor Embedding (t-SNE)

Association rule learning

Semi supervised learning

Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data. This is called semi supervised learning.

Most semi supervised learning algorithms are combinations of unsupervised and supervised algorithms. For example, deep belief networks (DBNs) are based on unsupervised components called restricted Boltzmann machines (RBMs) stacked on top of one another. RBMs are trained sequentially in an unsupervised manner, and then the whole system is fine-tuned using supervised learning techniques.

Reinforcement Learning

Reinforcement Learning is a very different beast. The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards). It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.

This project is mainly a supervised approach where we will use a sentiment analyzer to predict the sentiment of the reviews. The sentiment of the reviews can be of three types: Positive, Neutral and Negative. We will only focus on positive and negative reviews made by a customer after purchasing a product.

The data and analysis will contain Samsung smartphone reviews made by a customer after purchasing different types of smart phones. The data will undergo text preprocessing as it is very crucial to clean the data before making any analysis.

Some of the preprocessing techniques are given below.

Lower casing of data

Initially the reviews are both in lower and upper case but we will convert all the reviews to lower case.

Punctuation Removal

After conversion of reviews into lower case, we now remove all the punctuations from the reviews and clean it.

Missing Value Imputation

No imputation or dropping of attributes is required as our data do not contain any missing values.

Stopwords Removal

Stopwords are common words that occur in a text body. Stopwords must be removed before making any analysis.

Graphical inference of the data

A picture can describe 1000 words. A proper graphical analysis is an important step in Machine learning as it provides a clear and precise understanding of the insights driven from the data.

Here we will perform two very important graphical analysis.

Word Cloud

Word cloud is a visual representation of words where the size of the words tells us how frequent the word appeared in a document.

N-Gram analysis

An  n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemessyllablesletterswords or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.

Discussion of Data source

The data of phone reviews have been collected from Kaggle which is regarded as one of the largest data hubs for machine learning enthusiasts and researchers. The data contain about more than 65000 reviews of various smartphones which will help us grab the different opinions of different customers after purchase of the products.

Tools

Python

For data analysis and sentiment mining, we will use Python as our main programming language. Python is preferred mainly because of its flexibilities in handling text data and the availability of the libraries which make very easy interpretation and analysis while preprocessing the data

Tableau

Tableau is one of the finest frameworks considered for data visualizations and graphical analysis. Tableau is chosen for the efficient drag and drop features which saves enough effort to build intuitive dashboards and seamless storytelling.

Justification of Choice

For sentiment analysis, Vader Sentiment Analyzer is the right fit for any sentiment analysis model. It tries to find out the polarity scores of each review and tell us whether they are positive, negative or neutral.

Python provides lots of libraries for data manipulation and text mining and also it is much reliable and smooth compared to other programming languages.

Tableau is chosen for data visualization of different reviews and it makes our work much easier for building dashboard and storytelling. It helps us to explain the data much easily to the senior management and business problems can be easily interpreted through this framework.

Key Results

After importing the data we have cleaned it and we have also observed that the maximum rating of the data which are 5 stars reviews are more than 35000 in the entire data and we also observed that less than 3 stars reviews are very less present in the data and we are going to make a sentiment analysis on all the overall reviews and we will observe the most positive and the negative reviews present in the data .After taking particular preprocessing steps in order to attempt cleaning the text we have cleaned the text by lowering the case and also we have removed the punctuations from the text and we have also removed the stop words present in the text. The stop words heavily influence the data when it comes to sentiment analysis. So, in order to stop such influence, we have to remove the stop words in order to build sentiment analyzer out of the reviews. We have used Vader sentiment intensity analyzer to build a sentiment analysis.

Vader sentiment analyzer always considers lexicon which is usually features such as words which are labelled according to the orientation either positive or negative words. The Vader sentiment analyzer returns a score and based on the score we have to observe how much positive or negative sentiment is. The Vader sentiment analyzer also returns a compound score which is a value that has been normalized between a range of – 1 to +1. So, from that compound score if a sentiment lies between a compound score of greater than 0.05 it is termed as positive sentiment. If a sentiment lies between greater than -0.05 and less than 0.05 it is termed as neutral sentiment. The sentiment which is less than -0.05 it is termed as negative sentiment. For the safe zone, we have considered the threshold value of 0 and anything about zero we have considered as a positive sentiment and anything below zero we have considered it as a negative sentiment. We did not consider neutral sentiment as it is not so important evaluation measures to help us understand what business steps, we have to take in order to improve it.

After taking the threshold value of 0, we have got the following number of positive and negative reviews

As we can see, most of the reviews are positive and only 7422 out of around 65000 reviews are negative.

From this analysis, it is easy to tell that most of the customers are happy after purchasing the Samsung smartphones.

Visualization Results

Now let us perform n-gram analysis on our positive and negative reviews

N-gram analysis on Positive reviews

From the positive reviews, we have seen some of the phones are highly loved by the customers and these are

Samsung Galaxy Note 3

Samsung Galaxy Note 4

Samsung Galaxy S5 Mini

Samsung Galaxy S6 Edge

Samsung Galaxy S7 edge

N-gram analysis on negative reviews

As we can see, most of the people are offended by problems such as network bands as they are highly mentioned in the negative reviews.

Now let us look into the word clouds of the phones.

Let us see the word cloud of the phones that are rated more than 3 stars.

Most of the phone reviews have words such as ‘good’,’lasting’,’great’,’nice’ which suggests that the customers feel good to have such phones. Now there are also words such as ‘battery’ and ‘slow’ which also suggests that these phones are sometimes slow while using and might or might not have good battery backup. This statement will be tested by looking into those phones which got ratings less than 3.

From the word cloud, the words like ‘problems’, ’responsive’, ’battery’, ’charging’ are highly used which states that the phones are highly criticized for battery backup and responsiveness.

Business Story Telling

Let us compare both the positive and negative trigrams side by side to grasp some of the analysis

The n-grams highly used in the positive reviews is ‘great phone great price’.

The n-grams highly used in the negative reviews is ‘warning warning warning’.

From both the graphical analysis, we can take the following steps to increase the performance of the products.

Network Upgrade

Most of the customers are not happy with the network bands available in the smart phones. Upgrade of network in smart phones can slightly increase the sales of the products and regain the interest of the sad customers.

Battery Issues

Most of the customers need good lasting of batteries in their phones as Samsung target middle to high class customers which also suggests such customers work more and they need their phone to be working during long hours for calls and other activities.

In order to increase the demand of smartphones, battery issue is a major concern to be considered.

Software Issues

From the trigrams, the maximum words occurred is about the warning which claims that most of the customers are suffering from software issues.

Upgrading softwares on regular basis enables a customer faces smooth experiences which is one of the prime factors for successful marketing campaigns.

Touch Issues

Many customers faced touch screen issues which is another factors for bad reviews and ratings. Touch screen issues can be solved by upgrading display qualities and major software issues.

Conclusion

Vader sentiment analyzer proves to be very efficient analyzer as most of the reviews are segmented based on positive and negative reviews. It is interesting to see that most of the products getting low reviews fall into positive sentiments while most of the good rating products falls into negative sentiments. The graphical analysis show most of the customers are facing battery issues and network issues in most of the low rated smart phones. Although most of the phones seem to have a positive sentiment, some phones are under rated due to many issues. Business campaigning can be focused on mainly such issues mentioned in Storytelling in order to improve the marketing sales and demand of smart phones. The word cloud of both high and low rated smart phones shows the major concerns of the smart phones lie in battery and software issues and this could be improved to increase the sales of the products. The n-gram analysis can also be altered to look into the unigrams or the bigrams of the reviews. The unigram or bigrams could not give any meaningful insights which is why trigram is chosen in order to grasp intuitive analysis regarding the opinion of the customers. Preprocessing of text is done in order to help the sentiment analyzer make a good analysis and find a good compound score. Some other preprocessing steps can be done such as lemmatization or stemming and POS tagging but such methods proved to be significant when we carry out text classification process.

From the sentiment analyzer or considering the rating as the target variable, text classification is possible in this model where we can predict the rating of the future reviews given by a customer after purchasing a product. Since our objective is to pull out sentiments from the reviews, we did not take classification as the main aim of the project.

Project Limitations

Despite the flexible assumptions made by the analyzer, there lie certain limitations of the project.

This analysis can be done only on such products purchased by the customer. For such issues, the data should be updated to recent times for attractive storytelling from the insights.

Sentiment analyzer also pulled positive sentiments from those reviews which are low rated and vice versa. For such issue, other methods of analyzing sentiments can be tried in order to look into the performance of both the analyzer.

New issues faced by the customers should be added to change the fate of the sentiments and different levels of intuition can be made from the new data.

The data contain different types of phones which can be problematic as most of the modern smart phones are touch screen interfaced rather than having a keypad. So sentiment of modern day reviews can tell more about business campaigning.

Different upgrade of softwares used in the smart phones can have different issues from which sentiment analysis can be different which should be taken into account.

After solving all the issues, it can be still challenging to predict the market demand of products as different problems arise in different period for different customers which is why update of reviews and feedback is necessary.

Recommendations

Data update is a necessary step in machine learning as it often leads to problems such as overfitting and underfitting. Reviews play an important part in modern day marketing and Samsung releases new products in very less amount of time. So reviews on latest product should be added to the data to make an analysis and proper marketing campaign should be set by taking account of the problems faced by the customers.

Also the phones with low rating can be removed and sentiment analysis can be done only to those phones which are high rated. This step is recommended as we can focus only on those phones which are high rated and solution of problems for those phones can increase the demand of those phones in the market.

Changing of threshold value to define the sentiment as positive or negative is recommended as it can give a much better quality in evaluating the reviews made by a customer for a particular product.

Streaming of live data from any online source is advisable as it can give a glimpse of reviews on those products which are released or going to be released.

Relying only on the reviews is not advisable as there can be other parameters which can tell us about the opinion of the customers buying a particular product.

In such cases, other parameters such as votes of the reviews, time of the reviews, rating of the product are counted as these can be efficient enough to make an analysis to draw out proper opinions of the customers.

References

Ando, M. and S. Ishizaki (2012) Analysis of travel review data from Reader’s point of View. In Proceedings of WASSA-2012. Jeju, SouthKorea.

Carrillo de Albornoz, J., L. Plaza, P. Gervás and A. Diaz (2011). A joint model for feature mining and sentiment analysis for product review rating. In Proceedings of ECIR-2011. Dublin, Ireland.

Choi, Y. and C. Cardie (2008). Learning with Compositional Semantics as Structural Inference for Subsentential Sentiment Analysis. In Proceedings of EMNLP ’08. Hawaii, USA.

Ghose, A., G. Ipeirotis and B. Li (2012). Designing Ranking Systems for Hotels on Travel Search Engines by Mining User-Generated and Crowdsourced Content. Marketing Science, Vol. 31.

Liu, B. (2012). Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers, USA.

Mahony, M., P. Cunningham and B. Smyth (2010). An assessment of machine learning techniques for review recommendation. In Proceedings of AICS.

Maks, I., and P. Vossen (2011) Different Approaches to Automatic Polarity Annotation at Synset Level. In: Proceedings of the First International Workshop on Lexical Resources, WoLeR 2011, Ljubljana. Toprak, C., N. Jakob and I. Gurevych. (2010) Sentence and Expression Level Annotation of Opinions in User-Generated Discourse. In ACL 2010. Uppsala, Sweden.

Wang, S. and C. Manning. (2012). Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. In Proceedings of ACL-2012.

Tags: , , , , , , , , , , , , , , , , , , ,