Lecture 4: Analyzing Tabular Data with Pandas

yes it is :slight_smile: we can also use df.isna() to see whether df has nan

incorrect reset the answer

there is, but its already written in by the time you first open the notebook. its on the first few lines.

If you’re running locally, you would not need to install every time. Since binder is a temporary instance, you will need to install and then import

Why Na( explains that the data is just missing) not used instead of NaN( “not a number” and it means there is a result, but it cannot be represented) in the data frame?

9 posts were split to a new topic: How to deal with missing values (NaN)?

Yeah it correct null set

cases_df = covid_df[covid_df['date'=={'2020-09-02'}], 'new_cases']

It seems I implement it wrongly can you give some hits on it thanks!

1 Like

Why cant i use pip install Jovian in Kaggle

restart the kernal ok

How can you retrieve the rows with a specific data, for example the rows with deaths =10 or deaths > 10 ?

covid_df.at[any rows, and deaths
]

deaths = df[df[deaths > 10]]
This returns a dataframe with all the rows which have deaths greater than 10.
Correct me,if i am wrong!!

yeah but it does not incude the 10 to incude the 10 use >=

1 Like

I had used flower brackets to indicate that you will have to insert value there.

You dont need at if you’re not going with index

image

To get from specific col
image

2 Likes

Happy Teacher day…

2 Likes

high_cases_df = covid_df[covid_df.new_cases > 1000], it is interesting that we can just use . to index specific col rather df[df[col]]!! Nice and fast, wonder what if I want to choose more than 2 cols with 2 different critera:

high_ratio_df = covid_df[covid_df.new_cases >100 / covid_df.new_tests > positive_rate]

and this gives error…it seems I need to subset covid_df.new_cases >100 to boolean subset df… a bit not efficient here though

2 Likes

its better to use df[df[col]] than . since spaces in the column name might hinder

1 Like

Im confused now .sort_values' and the sorting function in numpy` as you covered last week. how do we know which one should we use since they serve the same purpose (or not? depending on dataframe)?

You can write the condition and use .index.