There are certain situations where we have to extract certain portions from text fields/columns. We can do this using a combination of apply,lambda and split functions.For our blog, we will use a LOT( line of therapy) example where a gien cell has compressed values of therapy progression and we are trying to extract the first line of therapy¶
Step 1: Importing Libraries¶
In [3]:
import pandas as pd
import numpy as np
import re
In [4]:
# Lets look at a small example
# Here each line of treatment is separated by |
# So LOT1 is the first line of therpay
# LOT2 is the first line of therpay
txt = 'LOT 1 | LOT2'
txt.split("|")
Out[4]:
['LOT 1 ', ' LOT2']
In [13]:
# We see that we can use split function to break the text
# And then use positional arguemnts to extract the required lOT
Step 2: Creating the sample dataset¶
In [5]:
Line_of_Therapy = ['Drug 1 | Drug 2','Drug 1 + Drug 3 | Drug 4 + Drug 5 | Drug 3']
Line_of_Therapy
Out[5]:
['Drug 1 | Drug 2', 'Drug 1 + Drug 3 | Drug 4 + Drug 5 | Drug 3']
In [6]:
Patient = [1,2]
Patient
Out[6]:
[1, 2]
In [7]:
df = pd.DataFrame(np.array([Patient,Line_of_Therapy])).T
df.columns=['Patient','Line of Therapy']
df
Out[7]:
Patient | Line of Therapy | |
---|---|---|
0 | 1 | Drug 1 | Drug 2 |
1 | 2 | Drug 1 + Drug 3 | Drug 4 + Drug 5 | Drug 3 |
Step 3: Extracting the second Line of therapy¶
Using apply and lambda to extract the required component¶
In [8]:
df['LOT2']=df['Line of Therapy'].apply(lambda x: x.split("|")[1].strip())
df
Out[8]:
Patient | Line of Therapy | LOT2 | |
---|---|---|---|
0 | 1 | Drug 1 | Drug 2 | Drug 2 |
1 | 2 | Drug 1 + Drug 3 | Drug 4 + Drug 5 | Drug 3 | Drug 4 + Drug 5 |