There are certain situations where we have to extract certain portions from text fields/columns. We can do this using a combination of apply and split functions.For our blog, we will use a LOT( line of therapy) example where a gien cell has compressed values of therapy progression and we are trying to extract the first line of therapy¶
Step 1: Importing Libraries¶
In [1]:
import pandas as pd
import numpy as np
import re
In [12]:
# Lets look at a small example
# Here each line of treatment is separated by |
# So LOT1 is the first line of therpay
# LOT2 is the first line of therpay
txt = 'LOT 1 | LOT2'
txt.split("|")
Out[12]:
['LOT 1 ', ' LOT2']
In [13]:
# We see that we can use split function to break the text
# And then use positional arguemnts to extract the required lOT
Step 2: Creating the sample dataset¶
In [20]:
Line_of_Therapy = ['Drug 1 | Drug 2','Drug 1 + Drug 3 | Drug 4 + Drug 5 | Drug 3']
Line_of_Therapy
Out[20]:
['Drug 1 | Drug 2', 'Drug 1 + Drug 3 | Drug 4 + Drug 5 | Drug 3']
In [21]:
Patient = [1,2]
Patient
Out[21]:
[1, 2]
In [22]:
df = pd.DataFrame(np.array([Patient,Line_of_Therapy])).T
df.columns=['Patient','Line of Therapy']
df
Out[22]:
Patient | Line of Therapy | |
---|---|---|
0 | 1 | Drug 1 | Drug 2 |
1 | 2 | Drug 1 + Drug 3 | Drug 4 + Drug 5 | Drug 3 |
Step 3: Extracting the second Line of therapy¶
Creating the function to split and extract the required component¶
In [49]:
def str_split_txt(x):
y=x.split("|")[1]
z=y.strip()
return(z)
In [50]:
# Testing the function
str_split_txt('Drug 1 | Drug 2')
# So function is working
Out[50]:
'Drug 2'
In [51]:
df['LOT2']=df['Line of Therapy'].apply(str_split_txt)
df
Out[51]:
Patient | Line of Therapy | LOT2 | |
---|---|---|---|
0 | 1 | Drug 1 | Drug 2 | Drug 2 |
1 | 2 | Drug 1 + Drug 3 | Drug 4 + Drug 5 | Drug 3 | Drug 4 + Drug 5 |