Python coding exercise for Pandas Part 1

Exercise 1:create a series from a list, numpy array and dict¶

In [1]:

import pandas as pd
import numpy as np

In [4]:

# List
l1=[1,2,3,4,5]
l1

Out[4]:

[1, 2, 3, 4, 5]

In [5]:

# Array
arr = np.array(['a','b','c','d','e'])
arr

Out[5]:

array(['a', 'b', 'c', 'd', 'e'], dtype='<U1')

In [13]:

# Dictionary
d1=dict(zip(l1,arr))
d1

Out[13]:

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

In [11]:

# Series from list
s1=pd.Series(l1)
s1

Out[11]:

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [12]:

# Series from array
s2=pd.Series(arr)
s2

Out[12]:

0    a
1    b
2    c
3    d
4    e
dtype: object

In [14]:

# Series from array
s3=pd.Series(d1)
s3

Out[14]:

1    a
2    b
3    c
4    d
5    e
dtype: object

Exercise 2:convert the index of a series into a column of a dataframe¶

In [20]:

s4=s3.reset_index()
df=pd.DataFrame(s4)
df

Out[20]:

	index	0
0	1	a
1	2	b
2	3	c
3	4	d
4	5	e

In [21]:

# Renaming the column with 0 as the header
df.columns=['index','Col1']
df

Out[21]:

	index	Col1
0	1	a
1	2	b
2	3	c
3	4	d
4	5	e

Exercise 3:combine many series to form a dataframe¶

In [26]:

s1=pd.Series(['a','b','c','d','e'])
s2=pd.Series([1,2,3,4,5])
s3=pd.Series(np.random.rand(5))

In [29]:

df=pd.DataFrame([s1,s2,s3]).T
df

Out[29]:

	0	1	2
0	a	1	0.960486
1	b	2	0.07414
2	c	3	0.586193
3	d	4	0.54194
4	e	5	0.603479

In [30]:

df.columns=['Name','ID','Salary']
df

Out[30]:

	Name	ID	Salary
0	a	1	0.960486
1	b	2	0.07414
2	c	3	0.586193
3	d	4	0.54194
4	e	5	0.603479

Exercise 4:Assign name to the series’ index¶

In [31]:

s1=pd.Series([1,2,3])
s1

Out[31]:

0    1
1    2
2    3
dtype: int64

In [42]:

s1.index.names=['Index_Name']
s1

Out[42]:

Index_Name
0    1
1    2
2    3
dtype: int64

Exercise 5:Compare items between two series¶

In [43]:

A = pd.Series([1,2,3,4])
A

Out[43]:

0    1
1    2
2    3
3    4
dtype: int64

In [130]:

B = pd.Series([3,4,5,6,7])
B

Out[130]:

0    3
1    4
2    5
3    6
4    7
dtype: int64

In [104]:

# Converting series into list
A_list=list(A)
A_list

Out[104]:

[1, 2, 3, 4]

In [131]:

B_list=list(B)
B_list

Out[131]:

[3, 4, 5, 6, 7]

Common elements¶

In [132]:

common_elements=set(A_list) & set(B_list)
common_elements_ls=list(common_elements)
common_elements_ls

Out[132]:

[3, 4]

Through list comprehension¶

In [133]:

[i for i in A_list for j in B_list if i==j]

Out[133]:

[3, 4]

Elements in A not present in B¶

In [134]:

Not_in_A=list(set(A_list)-common_elements)
Not_in_A

Out[134]:

[1, 2]

Elements in B not present in A¶

In [135]:

Not_in_B=list(set(B_list)-common_elements)
Not_in_B

Out[135]:

[5, 6, 7]

Using the series approach¶

Common Elements¶

In [138]:

A[A.isin(B)]

Out[138]:

2    3
3    4
dtype: int64

In A not present in B¶

In [139]:

A[~A.isin(B)]

Out[139]:

0    1
1    2
dtype: int64

In B not present in A¶

In [141]:

B[~B.isin(A)]

Out[141]:

2    5
3    6
4    7
dtype: int64

Exercise 6:get the minimum, 25th percentile, median, 75th, and max of a numeric series¶

In [142]:

# Creating a random series
s1=pd.Series(np.random.rand(10))
s1

Out[142]:

0    0.484274
1    0.990742
2    0.580644
3    0.161801
4    0.816207
5    0.640640
6    0.494005
7    0.562894
8    0.339194
9    0.988645
dtype: float64

In [143]:

min(s1)

Out[143]:

0.1618008141423638

In [144]:

max(s1)

Out[144]:

0.9907420622435107

In [146]:

s1.median()

Out[146]:

0.5717688611141396

In [147]:

s1.quantile(0.75)

Out[147]:

0.7723152702700348

Exercise 7:frequency counts of unique items of a series¶

In [148]:

s1=pd.Series([1,2,3,3,3,4,4,4,4,5])
s1

Out[148]:

0    1
1    2
2    3
3    3
4    3
5    4
6    4
7    4
8    4
9    5
dtype: int64

In [151]:

s1.value_counts()

Out[151]:

4    4
3    3
1    1
2    1
5    1
dtype: int64

Exercise 8:keep only top 2 most frequent values as it is and replace everything else as ‘Other’¶

In [158]:

s1=pd.Series([1,2,3,3,3,4,4,4,4,5])
s1

Out[158]:

0    1
1    2
2    3
3    3
4    3
5    4
6    4
7    4
8    4
9    5
dtype: int64

In [159]:

# Getting the count of values
s2=s1.value_counts()
s2

Out[159]:

4    4
3    3
1    1
2    1
5    1
dtype: int64

In [156]:

# Getting the top two most frequent values
top_2=list(s2[:2])
top_2

Out[156]:

[4, 3]

In [162]:

# Using numpy where function to create conditional column
arr1=np.where(s1.isin(top_2),s1,"Others")
arr1

Out[162]:

array(['Others', 'Others', '3', '3', '3', '4', '4', '4', '4', 'Others'],
      dtype='<U21')

In [163]:

s1_recoded=pd.Series(arr1)
s1_recoded

Out[163]:

0    Others
1    Others
2         3
3         3
4         3
5         4
6         4
7         4
8         4
9    Others
dtype: object

Exercise 9:bin a numeric series to 10 groups of equal size¶

In [164]:

s1=pd.Series(np.random.rand(20))
s1

Out[164]:

0     0.614530
1     0.020887
2     0.264154
3     0.342449
4     0.117960
5     0.081493
6     0.864229
7     0.642582
8     0.241652
9     0.656198
10    0.889487
11    0.545045
12    0.334097
13    0.820589
14    0.100319
15    0.890976
16    0.907038
17    0.258617
18    0.086489
19    0.289996
dtype: float64

In [168]:

gp_values=list(s1.quantile([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]))
gp_values

Out[168]:

[0.08598908915755857,
 0.11443163727729971,
 0.25352740749430075,
 0.27965892220551697,
 0.3382729724163345,
 0.5728387110301085,
 0.646666712157136,
 0.8293170005262019,
 0.8896355153009737]

In [186]:

s2=np.where(s1 <= gp_values[0],"G1",
         np.where((s1 > gp_values[0]) & (s1 <= gp_values[1]),"G2",
         np.where((s1 > gp_values[1]) & (s1 <= gp_values[2]),"G3", 
         np.where((s1 > gp_values[2]) & (s1 <= gp_values[3]),"G4",
         np.where((s1 > gp_values[3]) & (s1 <= gp_values[4]),"G5",
         np.where((s1 > gp_values[4]) & (s1 <= gp_values[5]),"G6",
         np.where((s1 > gp_values[5]) & (s1 <= gp_values[6]),"G7",
         np.where((s1 > gp_values[6]) & (s1 <= gp_values[7]),"G8",
         np.where((s1 > gp_values[7]) & (s1 <= gp_values[8]),"G9",
         "G10")))))))))
s2

Out[186]:

array(['G7', 'G1', 'G4', 'G6', 'G3', 'G1', 'G9', 'G7', 'G3', 'G8', 'G9',
       'G6', 'G5', 'G8', 'G2', 'G10', 'G10', 'G4', 'G2', 'G5'],
      dtype='<U3')

In [190]:

df=pd.DataFrame([s1,s2]).T
df.columns=['Values','Groups']
df

Out[190]:

	Values	Groups
0	0.6145298225899369	G7
1	0.020886734419647057	G1
2	0.2641535431059717	G4
3	0.3424493159747698	G6
4	0.11795974066079085	G3
5	0.0814925546914802	G1
6	0.8642288997565742	G9
7	0.6425819946348839	G7
8	0.2416520629589226	G3
9	0.6561977197090577	G8
10	0.8894865607576895	G9
11	0.5450446366568895	G6
12	0.33409662885789926	G5
13	0.8205890257186088	G8
14	0.10031922374333513	G2
15	0.8909761061905312	G10
16	0.9070383169324525	G10
17	0.25861684086660564	G4
18	0.08648870409823395	G2
19	0.28999584160521374	G5

Machine Learning Made Easy

Tuesday, February 1, 2022

Python Coding Exercise-Numpy and Pandas part 1