I am doing sentiment analysis and I want to count the number of unique values of my labels. When I use value_counts() I get multiple values for the same class of labels
the result of the command above is given below
Name: sentiment, dtype: int64
I don’t understand why I am having two values for the negative and neutral classes
the sentiment column is the label for the dataset. The labels are either negative, positive or neutral. When I apply
nunique() methods I get
negative, positive or neutral. negative, neutral and
5 respectively as the answer. When I use
count() method, I get the result shown below.
Still no idea what’s the type of the “sentiment” column.
Would be great to know what dataset it is.
What is the “Unnamed: 0” in your result using
How does your dataframe looks when you
display() it BTW?
I am working on sentiment analysis of tweets. The tweets are either labelled as negative, positive or neutral. The unnamed is pandas way of naming a column header without a name. It contains a count of the rows. I have not tried
I’m not asking what the “sentiment” column represents, or how it relates to anything in this dataset. I want to know it’s datatype.
Could you share link to this dataset? There are few such datasets on kaggle, and probably even more if you dig deep enough. Having a look at it might help me understand what’s going on.
it’s an object data type as shown below using
unfortunately can’t share the dataset because of the ethical clearance on the dataset from my university. I can give you the result of any command you want me to execute. for example
display() gives this error
AttributeError: 'DataFrame' object has no attribute 'print'
AttributeError: 'DataFrame' object has no attribute 'display'
I have been able to solve the problem by using find and replace in Google sheets. This ensures that there’s no space before or after the labels. In order word, I examined whether replacing text with numbers would eliminate the problem and it did. For example, I used 1, 2 and 3 to replace positive, negative and neutral respectively. I tested value_counts() before reversing the operation. Then I replaced 1, 2 and 3 to replace positive, negative and neutral respectively
I suspected that the labels have some sort of additional character that discriminates them from others, but had to be sure about the type (if it was
pd.Categorical then my theory fails).
object means more or less that this is just a string. Anyway good to know that you’ve found the way around it (I’m suspecting replacing it with numbers changed the cell type to number and removed the spaces).