1
votes

Compter le nombre de lignes contenant des mots

J'ai un ensemble de données avec de nombreuses lignes qui contiennent des descriptions de fruits, par exemple:

rows <- data_frame %>%
  filter(str_detect(variable, "apple"))
count_rows <- as.data.frame(nrow(rows))

J'ai besoin de trouver des mots uniques dans cette description ( Je l'ai déjà fait) et ensuite je dois compter dans combien de lignes ces mots uniques apparaissent. Exemple:

Apple 2 (rows)
Bananas 1 (rows)
tree 1 (rows)
tasty 2 (rows)

J'ai fait quelque chose comme ça:

An apple hangs on an apple tree
Bananas are yellow and tasty 
The apple is tasty

Mais le problème est que j'ai trop de mots donc je ne veux pas le faire manuellement. Des idées?

r text-mining

3 commentaires

Avez-vous une liste de mots pour lesquels vous voulez le décompte?

Oui, j'ai une liste.

Ça marche! Merci à tous pour votre aide: D

3 Réponses :

1
votes

Une option dplyr et tidyr pourrait être:

df <- data.frame(sentences = c("An apple hangs on an apple tree",
                               "Bananas are yellow and tasty",
                               "The apple is tasty"),
                 stringsAsFactors = FALSE)   

list_of_words <- tolower(c("Apple", "Bananas", "tree", "tasty"))

Exemple de données:

df %>%
 rowid_to_column() %>%
 mutate(sentences = strsplit(sentences, " ", fixed = TRUE)) %>%
 unnest(sentences) %>%
 mutate(sentences = tolower(sentences)) %>%
 filter(sentences %in% list_of_words) %>%
 group_by(sentences) %>%
 summarise_all(n_distinct)

  sentences rowid
  <chr>     <int>
1 apple         2
2 bananas       1
3 tasty         2
4 tree          1

1 commentaires

la fonction unnest est inconnue.

0
votes

Dans la base R, cela peut être fait comme suit.

x <-
"'An apple hangs on an apple tree'
'Bananas are yellow and tasty' 
'The apple is tasty'"

x <- scan(textConnection(x), what = character())
df <- data.frame(x)

words <- c("Apple", "Bananas", "tree", "tasty")

Données.

r <- apply(sapply(words, function(s) grepl(s, df[[1]], ignore.case = TRUE)), 2, sum)
as.data.frame(r)
#        r
#Apple   2
#Bananas 1
#tree    1
#tasty   2

0 commentaires

0
votes

Une solution de base R serait d'utiliser grepl avec sapply ou lapply:

sapply(list_of_words, function(x) sum(grepl(x, tolower(df$sentences), fixed = T)))
apple bananas    tree   tasty 
    2       1       1       2

0 commentaires