on
Data Frame
JSON
R
- Get link
- Other Apps
The other day, I found an interesting question on social media about text search. Furthermore, this text search was not a simple text search within a vector or pattern search within a string. The text search was more to do with data manipuation first and then text search. I think this is a good sample to demonstrate the following:
data.frame
function. As per rdocumentation.org, dataframe is "tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modelling software".Let's create a data frame with character variables.
df <- data.frame(item = c("blood", "bone", "central", "skin", "soft"), testset = c("0 0 6 65 73 41", "42", "90 53", "1", "65 68"), trueset = c("43 35 6 65 73 41", "42", "53 7 60", "73", "60 68")) df
## item testset trueset ## 1 blood 0 0 6 65 73 41 43 35 6 65 73 41 ## 2 bone 42 42 ## 3 central 90 53 53 7 60 ## 4 skin 1 73 ## 5 soft 65 68 60 68
Now when we know how the data look like, let's define the problem.
The problem is to identify the occurrence of the individual elements from vector into another. These vectors are saved in the dataframe. The result must to saved into individual variables within the same data frame. The conditions which define the search patterns are as follows.
a_result - count the occurrence of elements from "testset" variable within "trueset" variable.
b_result - count the non-occurrence of elements from "testset" variable within "trueset" variable.
It is an interesting exercise and I would recommend that you attempt it first before you go through the solution. Please put your version of the solution in the comments section. Thank you.
## item testset trueset a.result b.result
## 1 blood 0 0 6 65 73 41 43 35 6 65 73 41 4 2
## 2 bone 42 42 1 0
## 3 central 90 53 53 7 60 1 1
## 4 skin 1 73 0 1
## 5 soft 65 68 60 68 1 1
str
command for this. Here is the output of str
command over dataframe df
.## 'data.frame': 5 obs. of 3 variables:
## $ item : chr "blood" "bone" "central" "skin" ...
## $ testset : chr "0 0 6 65 73 41" "42" "90 53" "1" ...
## $ trueset : chr "43 35 6 65 73 41" "42" "53 7 60" "73" ...
df
contains three columns. All three of them are character type vectors. Comparing "testset" and "trueset" elements in their existing form will only result in observation #2 as a match. Look at the code below which explains this.## [1] "0 0 6 65 73 41" "42" "90 53" "1"
## [5] "65 68"
print(df$trueset)
## [1] "43 35 6 65 73 41" "42" "53 7 60" "73"
## [5] "60 68"
df$testset %in% df$trueset
## [1] FALSE TRUE FALSE FALSE FALSE
Before we start searching the elements from "testset" into "trueset", we will have to split both the character vectors by space. Here is the code for it.strsplit
is used to split the character vector in R. This function will return a list object of the same length and elements of this list will contain the split substring of the original vector. Let's split the two vectors and save them into two new variables within the same dataframe. Let's check out the code and output.df$test <- strsplit(df$testset, " ")
class(df$test)
## [1] "list"
df$test
## [[1]]
## [1] "0" "0" "6" "65" "73" "41"
##
## [[2]]
## [1] "42"
##
## [[3]]
## [1] "90" "53"
##
## [[4]]
## [1] "1"
##
## [[5]]
## [1] "65" "68"
df$true <- strsplit(df$trueset, " ")
class(df$test)
## [1] "list"
df$true
## [[1]]
## [1] "43" "35" "6" "65" "73" "41"
##
## [[2]]
## [1] "42"
##
## [[3]]
## [1] "53" "7" "60"
##
## [[4]]
## [1] "73"
##
## [[5]]
## [1] "60" "68"
unlist
function. Typically it takes a list or a vector as an input and returns a vector with the elements of the supplied list of the vector. Let's convert the two new variables we have added to the original dataframe.unlist(df$test)
## [1] "0 0 6 65 73 41" "42" "90 53" "1"
## [5] "65 68"
unlist(df$true)
## [1] "43 35 6 65 73 41" "42" "53 7 60" "73"
## [5] "60 68"
sapply(seq_along(df$true), function(idx) sum(df$true[[idx]] %in% unlist(df$test[idx])))
seq_along
creates a sequence for each element of the list. In the case of the variable "true", the following will be the result of seq_along function.seq_along(df$true)
## [1] 1 2 3 4 5
## [1] "43 35 6 65 73 41" "42" "53 7 60" "73"
## [5] "60 68"
df$a.result <-
sapply(seq_along(df$true), function(idx) sum(df$true[[idx]] %in% unlist(df$test[idx])))
df$b.result <-
sapply(seq_along(df$true), function(idx) sum(!(df$test[[idx]] %in% unlist(df$true[idx]))))
############################### ##www.dataenq.com ################################### library(tidyverse) #Make new dataframe df <- data.frame(item = c("blood", "bone", "central", "skin", "soft"), testset = c("0 0 6 65 73 41", "42", "90 53", "1", "65 68"), trueset = c("43 35 6 65 73 41", "42", "53 7 60", "73", "60 68")) df$test <- strsplit(df$testset, " ") df$true <- strsplit(df$trueset, " ") # use mapply to loop through and apply the function to the data frame over the # two lists and add result into new variable df$a.result <- sapply(seq_along(df$true), function(idx) sum(df$true[[idx]] %in% unlist(df$test[idx]))) df$b.result <- sapply(seq_along(df$true), function(idx) sum(!(df$test[[idx]] %in% unlist(df$true[idx])))) # print the final data frame df %>% select(item, testset, trueset, a.result, b.result)
## item testset trueset a.result b.result ## 1 blood 0 0 6 65 73 41 43 35 6 65 73 41 4 2 ## 2 bone 42 42 1 0 ## 3 central 90 53 53 7 60 1 1 ## 4 skin 1 73 0 1 ## 5 soft 65 68 60 68 1 1
Comments
Post a Comment