What are the top 5 most common jobs pandas năm 2024

When it comes to data science or data analysis, Python is pretty much always the language of choice. Its library Pandas is the one you cannot, and more importantly, shouldn’t avoid.

Nội dung chính Show

Dataset used
1. Head and Tail
2. DataFrame.info( )
4. Shape and Size
5. describe( )
7. Identifying Missing Values Isnull
9. Identifying Missing Values Sum
10. Nunique
11.Index( ) and column( )
12.Memory_usage
13. nsmallest() and nlargest()
14. Loc and iloc
16. Groupby
Before you go
What are pandas used for?
What is the most common category pandas?
How to get top 10 values in pandas?
Which companies use pandas?

Pandas is a predominantly used python data analysis library. It provides many functions and methods to expedite the data analysis process. What makes pandas so common is its functionality, flexibility, and simple syntax.

While Pandas by itself isn’t that difficult to learn, mainly due to the self-explanatory method names, having a cheat sheet is still worthy, especially if you want to code out something quickly. That’s why today I want to put the focus on how I use Pandas to do Exploratory Data Analysis by providing you with the list of my most used methods and also a detailed explanation of those.

Dataset used

I will do the examples on the california housing dataset.This Dataset and code is available in this link.

https://github.com/jbolla368/Pandas-using-in-EDA.git

Here’s how to import the Pandas library and load in the dataset:

datasets loaded

1. Head and Tail

Once we read a dataset into a pandas data frame, we want to take a look at it to get an overview. The simplest way is to display some rows. Head and tail allow us to display rows from the top of bottom of data frame, respectively.

5 rows are displayed by default but we can adjust it just by passing the number of rows we’d like to display.

2. DataFrame.info( )

Pandas` dataframe.info()` function is used to get a concise summary of the dataframe. It comes really handy when doing exploratory analysis of the data. To get a quick overview of the dataset we use the dataframe.info() function.

3. Dtypes

We need to have the values stored in an appropriate data type. Otherwise, we may encounter errors. For large datasets, memory-usage is greatly affected by correct data type selection. For example, “categorical” data type is more appropriate than “object” data type for categorical data especially when the number of categories is much less than the number of rows.

Dtypes shows the data type of each column.

4. Shape and Size

Shape can be used on numpy arrays, pandas series and dataframes. It shows the number of dimensions as well as the size in each dimension.

Since dataframes are two-dimensional, what shape returns is the number of rows and columns. It is a measure of how much data we have and a key input to the data analysis process.

Furthermore, the ratio of rows and columns is very important when designing and implementing a machine learning model. If we do not have enough observations (rows) with respect to features (columns), we may need to apply some pre-processing techniques such as dimensional reduction or feature extraction.

Size, as the name suggests, returns the size of a dataframe which is the number of rows multiplied by the number of columns.

5. describe( )

If there’s one thing you do over and over again in the process of exploratory data analysis — that’s performing a statistical summary for every (or almost every) attribute.

It would be a quite tedious process without the right tools — but thankfully Pandas is here to do the heavy lifting for you. The describe() method will do a quick statistical summary for every numerical column, as shown below:

Now I’m using the transpose operator to switch from columns to rows, and vice-versa.

6. Sample

Sample method allows you to select values randomly from a Series or DataFrame. It is useful when we want to select a random sample from a distribution.

7. Identifying Missing Values Isnull

Handling missing values is a critical step to build a robust data analysis process. The missing values should be a top priority since they have a significant effect on the accuracy of any analysis.

8. Isna

Isna function returns a dataframe filled with boolean values with true indicating missing values.

9. Identifying Missing Values Sum

Let’s first compute the total number of missing values in the data frame. You can calculate the number of missing values in each column by df.isnull().sum()

10. Nunique

Nunique counts the number of unique entries over columns or rows. It is very useful in categorical features especially in cases where we do not know the number of categories beforehand. Let’s take a look at our initial dataframe:

11.Index( ) and column( )

index() is an inbuilt function in Python, which searches for a given element from the start of the list and returns the lowest index where the element appears.

Column( )

12.Memory_usage

Memory_usage() returns how much memory each column uses in bytes. It is useful especially when we work with large dataframes. Consider the following dataframe with 1 million rows.

13. nsmallest() and nlargest()

I’m guessing there’s no doubt about the purposes of these two methods after just reading their names, but nevertheless, they can prove to be worthy in the process of exploratory data analysis.

Let’s see how we’d go about finding the 5 observations with the smallest value

Let’s see how we’d go about finding the 5 observations with the Largest value.

14. Loc and iloc

Loc and iloc are used to select rows and columns.

loc: select by labels
iloc: select by positions

loc is used to select data by label. The labels of columns are the column names. We need to be careful about row labels. If we do not assign any specific indices, pandas created integer index by default. Thus, the row labels are integers starting from 0 and going up. The row positions that are used with iloc are also integers starting from 0.

Selecting first 6 rows and 2 columns with loc:

Selecting first 4 rows and first 6 columns with iloc:

15.Slicing

Slicing Rows and Columns using labels. You can select a range of rows or columns using labels or by position. To slice..

16. Groupby

pandas groupby function is a great tool in exploring the data. It makes it easier to unveil the underlying relationships among variables. The figure below shows an overview of what groupby function does.

17.Sorting

Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame can be sorted. By default, sorting is done on row labels in ascending order. Order of Sorting. By passing the Boolean value to ascending parameter, the order of the sorting can be controlled.

sorting by values

18.Dropna

The dropna () function is used to remove a row or a column from a dataframe which has a NaN or no values in it.

19. Query

We sometimes need to filter a dataframe based on a condition or apply a mask to get certain values. One easy way to filter a dataframe is query function. Let’s first create a sample dataframe.

20.Insert

When we want to add a new column to a dataframe, it is added at the end by default. However, pandas offers the option to add the new column in any position using insert function.

We need to specify the position by passing an index as first argument. This value must be an integer. Column indices start from zero just like row indices. The second argument is column name and the third argument is the object that includes values which can be Series or an array-like object.

Before you go

I hope this article will point you in the right direction when it comes to Python, Pandas, and Exploratory Data Analysis in general.

What are pandas used for?

What is Pandas? Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

What is the most common category pandas?

In pandas, the mode() method is used to find the mode, the most frequent value, of a column or row in a DataFrame . This method is also available on Series . To get unique values and their counts, use the unique() , value_counts() , and nunique() methods.

How to get top 10 values in pandas?

Using Groupby() function of pandas to group the columns Now, we will get topmost N values of each group of the 'Variables' column. Here reset_index() is used to provide a new index according to the grouping of data. And head() is used to get topmost N values from the top.

Which companies use pandas?

226 companies reportedly use Pandas in their tech stacks, including Instacart, Delivery Hero, and Tokopedia..

Instacart..

Delivery Hero..

Tokopedia..

trivago..

RatePAY GmbH..

Avito..

Data Scientist..

Sendcloud..