1
votes

Distribution de fréquence conditionnelle à l'aide de Browns Corpus NLTK Python

J'essaie de déterminer les mots se terminant par «ing» ou «ed». Calculez la distribution de fréquence conditionnelle, où la condition est ['gouvernement', 'hobbies'] et l'événement est soit 'ing' ou 'ed'. Stockez la distribution de fréquence conditionnelle dans la variable inged_cfd.

Voici mon code: -

            ed  ing 
government 2507 1474 
   hobbies 2561 2169

Je veux sortir dans un format tabulaire, en utilisant ce code ci-dessus, j'obtiens la sortie comme: -

            ed  ing 
government 2507 1605 
   hobbies 2561 2262

Alors que la production réelle est: -

from nltk.corpus import brown
import nltk

genre_word = [ (genre, word.lower())
              for genre in ['government', 'hobbies']
              for word in brown.words(categories = genre) if (word.endswith('ing') or word.endswith('ed')) ]
            
genre_word_list = [list(x) for x in genre_word]

for wd in genre_word_list:
    if wd[1].endswith('ing'):
      wd[1] = 'ing'
    elif wd[1].endswith('ed'):
      wd[1] = 'ed'
      
inged_cfd = nltk.ConditionalFreqDist(genre_word_list)
        
inged_cfd.tabulate(conditions = ['government', 'hobbies'], samples = ['ed','ing'])

Veuillez résoudre mon problème et m'aider à obtenir le résultat exact.

python-3.x nltk corpus

0 commentaires

6 Réponses :

3
votes

Besoin d'exclure les mots vides. De même, lors de la vérification des fins avec condition, changez la casse en une valeur inférieure. Code de travail comme suit:

from nltk.corpus import brown
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) 
genre_word = [ (genre, word.lower()) 
for genre in brown.categories() for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
genre_word_list = [list(x) for x in genre_word]

for wd in genre_word_list:
    if wd[1].endswith('ing') and wd[1] not in stop_words:
        wd[1] = 'ing'
    elif wd[1].endswith('ed') and wd[1] not in stop_words:
        wd[1] = 'ed'
  
inged_cfd = nltk.ConditionalFreqDist(genre_word_list)    
inged_cfd.tabulate(conditions = cfdconditions, samples = ['ed','ing'])

0 commentaires

0
votes

J'ai utilisé la solution, mais je ne parviens toujours pas à passer certains tests. 2 cas de test échouent toujours.

Pour les cas de test ayant échoué, ma sortie est:

def calculateCFD(cfdconditions, cfdevents):
    # Write your code here
    from nltk.corpus import brown
    from nltk import ConditionalFreqDist
    from nltk.corpus import stopwords
    stopword = set(stopwords.words('english'))
    cdev_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stopword]
    cdev_cfd = [list(x) for x in cdev_cfd]
    cdev_cfd = nltk.ConditionalFreqDist(cdev_cfd)
    a = cdev_cfd.tabulate(condition = cfdconditions, samples = cfdevents)
    inged_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
    inged_cfd = [list(x) for x in inged_cfd]
    for wd in inged_cfd:
        if wd[1].endswith('ing') and wd[1] not in stopword:
            wd[1] = 'ing'
        elif wd[1].endswith('ed') and wd[1] not in stopword:
            wd[1] = 'ed'

    inged_cfd = nltk.ConditionalFreqDist(inged_cfd)    
    b = inged_cfd.tabulate(conditions = sorted(cfdconditions), samples = ['ed','ing'])
    return(a,b)

                 many years 
      adventure    24    32 
        fiction    29    44 
science_fiction    11    16 
                  ed  ing 
      adventure 3281 1844 
        fiction 2943 1767 
science_fiction  574  293

et mon code est

                  good    bad better 
      adventure     39      9     30 
        fiction     60     17     27 
        mystery     45     13     29 
science_fiction     14      1      4 
                  ed  ing 
      adventure 3281 1844 
        fiction 2943 1767 
        mystery 2382 1374 
science_fiction  574  293

Si quelqu'un peut apporter une solution à cela, ce serait d'une grande aide. Merci

0 commentaires

3
votes

L'utilisation de la même variable cfdconditions aux deux endroits crée le problème. En fait, en python, tout fonctionne comme une référence d'objet, donc lorsque vous avez utilisé cfdconditions première fois, il peut être modifié lorsque vous cdev_cfd.tabulate à cdev_cfd.tabulate et lorsque vous passez la prochaine fois, il est passé comme modifié. Mieux vaut si vous initialisez une liste supplémentaire, puis passez cet appel au second.

Voici ma modification

from nltk.corpus import brown

from nltk.corpus import stopwords

def calculateCFD(cfdconditions, cfdevents):
    stop_words= stopwords.words('english')
    at=[i for i in cfdconditions]
    nt = [(genre, word.lower())
          for genre in cfdconditions
          for word in brown.words(categories=genre) if word not in stop_words and word.isalpha()]

    cdv_cfd = nltk.ConditionalFreqDist(nt)
    cdv_cfd.tabulate(conditions=cfdconditions, samples=cfdevents)
    nt1 = [(genre, word.lower())
          for genre in cfdconditions
          for word in brown.words(categories=genre) ]
    
    temp =[]
    for we in nt1:
        wd = we[1]
        if wd[-3:] == 'ing' and wd not in stop_words:
            temp.append((we[0] ,'ing'))

        if wd[-2:] == 'ed':
            temp.append((we[0] ,'ed'))
        

    inged_cfd = nltk.ConditionalFreqDist(temp)
    a=['ed','ing']
    inged_cfd.tabulate(conditions=at, samples=a)

J'espère que cela aide!

0 commentaires

0
votes

La sortie attendue est -

                  good    bad better 

      adventure     39      9     30 

        fiction     60     17     27 

science_fiction     14      1      4 

        mystery     45     13     29 

                  ed  ing 

      adventure 3281 1844 

        fiction 2943 1767 

science_fiction  574  293 

        mystery 2382 1374

                 many years 

        fiction    29    44 

      adventure    24    32 

science_fiction    11    16 

                  ed  ing 

        fiction 2943 1767 

      adventure 3281 1844 

science_fiction  574  293

2 commentaires

Cela ne montre pas comment obtenir le résultat correct

ishan Kankane partage le code ci-dessus et cela fonctionne parfaitement. La différence que j'ai remarquée est 1) l'utilisation d'isalpha () (bien que ce ne soit pas mentionné en question) - essayez d'ajouter cela également 2) tout en générant la liste (de 'ing' et 'ed') - généralement j'ai vu que c'est un liste de tuples ... mais dans le code, nous utilisons la liste de liste (essayez de la convertir également) 3) tout en générant le (genre, mot) dans la condition If - il n'utilise pas if word.lower () pas dans les mots vides , il a juste utilisé si le mot n'est pas dans les mots vides - Essayez ceci aussi

0
votes

J'ai utilisé cette approche, elle a moins de lignes de code et plus rapide

from nltk.corpus import brown
from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    cdev_cfd = nltk.ConditionalFreqDist([(genre, word.lower()) for genre in cfdconditions
          for word in brown.words(categories=genre) if word.lower() not in stop_words])
    
    inged_cfd = nltk.ConditionalFreqDist([(genre, word[-3:].lower() if word.lower().endswith('ing') else word[-2:].lower()) 
                                          for genre in conditions for word in brown.words(categories=genre) 
                                          if word.lower() not in stop_words and  (word.lower().endswith('ing') or word.lower().endswith('ed'))])
    
    cdev_cfd.tabulate(conditions=conditions, samples=cfdevents)
    
    inged_cfd.tabulate(conditions=conditions, samples=['ed','ing'])

0 commentaires

-1
votes

from nltk.corpus import stopwords,brown
def calculateCFD(cfdconditions, cfdevents):
    # Write your code here
    stop_words=set(stopwords.w`enter code here`ords("english"))
    list1=[(genre,word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stop_words]
    cfd1=nltk.ConditionalFreqDist(list1)
    cfd1_tabulate=cfd1.tabulate(conditions=cfdconditions,samples=cfdevents)
    #print(cfd1_tabulate)
    
    list2=[[genre,word.lower()] for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stop_words if (word.lower().endswith("ed") or word.lower().endswith("ing"))]
    for elem in list2:
        if elem[1].endswith("ed"):
            elem[1]="ed"
        else:
            elem[1]="ing"
            
    cfd2=nltk.ConditionalFreqDist(list2)
    cfd2_tabulate=cfd2.tabulate(conditions=cfdconditions,samples=["ed","ing"])
    #print(cfd2_tabulate)
    
    return cfd1_tabulate,cfd2_tabulate

1 commentaires

Bonjour, bienvenue dans la communauté SO! Vous êtes toujours encouragé à ajouter du texte pour expliquer ce que fait votre code, au lieu de le coller tout seul!