5
votes

Rechercher plusieurs chaînes pour plusieurs mots

J'ai un dataframe contenant une phrase par ligne. J'ai besoin de rechercher dans ces phrases l'occurrence de certains mots. Voici comment je le fais actuellement:

import pandas as pd

p = pd.DataFrame({"sentence" : ["this is a test", "yet another test", "now two tests", "test a", "no test"]})

test_words = ["yet", "test"]
p["word_test"] = ""
p["word_yet"]  = ""

for i in range(len(p)):
    for word in test_words:
        p.loc[i]["word_"+word] = p.loc[i]["sentence"].find(word)

Cela fonctionne comme prévu, cependant, est-il possible d'optimiser cela? Il fonctionne assez lentement pour les grands dataframes

python pandas dataframe

0 commentaires

3 Réponses :

5
votes

IIUC, utilisez une simple compréhension de liste et appelez str.find pour chaque mot:

pd.concat([df, u], axis=1)

           sentence  word_yet  word_test
0    this is a test        -1         10
1  yet another test         0         12
2     now two tests        -1          8
3            test a        -1          0
4           no test        -1          3

u = pd.DataFrame({
    # 'word_{}'.format(w)
    f'word_{w}': df.sentence.str.find(w) for w in test_words}, index=df.index)
u
   word_yet  word_test
0        -1         10
1         0         12
2        -1          8
3        -1          0
4        -1          3

0 commentaires

5
votes

Vous pouvez utiliser str.find a>

p['word_test'] = p.sentence.str.find('test')
p['word_yet'] = p.sentence.str.find('yet')

    sentence         word_test  word_yet    word_yest
0   this is a test   10         -1          -1
1   yet another test 12          0          0
2   now two tests    8          -1          -1
3   test a           0          -1          -1
4   no test          3          -1          -1

0 commentaires

5
votes

Puisque vous avez mentionné de meilleures performances en utilisant np.char.find

df=pd.DataFrame(data=[np.char.find(p.sentence.values.astype(str),x) for x in test_words],index=test_words,columns=p.index)
pd.concat([p,df.T],axis=1)
Out[32]: 
           sentence  yet  test
0    this is a test   -1    10
1  yet another test    0    12
2     now two tests   -1     8
3            test a   -1     0
4           no test   -1     3

0 commentaires