How to search elements of one list into another in R and other interesting stuff

The other day, I found an interesting question on social media about text search. Furthermore, this text search was not a simple text search within a vector or pattern search within a string. The text search was more to do with data manipuation first and then text search. I think this is a good sample to demonstrate the following:

  1. Data
  2. How to split the elements of a character vector in R?
  3. How to convert a list to a vector?
  4. How to search a list element into a vector?
  5. How to add a new variable (column) into an existing data frame?

Data

With the outline of the article is set, let's prepare the data. In R, a dataframe can be created using data.frame function. As per rdocumentation.org, dataframe is "tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modelling software".

Let's create a data frame with character variables.

df <- data.frame(item = c("blood", "bone", "central", "skin", "soft"),
                           testset = c("0 0 6 65 73 41", "42", "90 53", "1", "65 68"),
                           trueset = c("43 35 6 65 73 41", "42", "53 7 60", "73", "60 68"))
df
##      item        testset          trueset
## 1   blood 0 0 6 65 73 41 43 35 6 65 73 41
## 2    bone             42               42
## 3 central          90 53          53 7 60
## 4    skin              1               73
## 5    soft          65 68            60 68

Now when we know how the data look like, let's define the problem.

The problem

The problem is to identify the occurrence of the individual elements from vector into another. These vectors are saved in the dataframe. The result must to saved into individual variables within the same data frame. The conditions which define the search patterns are as follows.

a_result - count the occurrence of elements from "testset" variable within "trueset" variable.

b_result - count the non-occurrence of elements from "testset" variable within "trueset" variable.

It is an interesting exercise and I would recommend that you attempt it first before you go through the solution. Please put your version of the solution in the comments section. Thank you.

Expected result

Here is the expected output:
##     item         testset           trueset a.result b.result
## 1   blood   0 0 6 65 73 41 43 35 6 65 73 41        4        2 
## 2    bone               42               42        1        0 
## 3 central            90 53          53 7 60        1        1 
## 4    skin                1               73        0        1 
## 5    soft            65 68            60 68        1        1 

The Solution

Before we start looking at the solution, let's examine the available data. We will use str command for this. Here is the output of str command over dataframe df.
## 'data.frame':    5 obs. of  3 variables:
##  $ item            : chr  "blood" "bone" "central" "skin" ...
##  $ testset         : chr  "0 0 6 65 73 41" "42" "90 53" "1" ...
##  $ trueset         : chr  "43 35 6 65 73 41" "42" "53 7 60" "73" ...
Dataframe df contains three columns. All three of them are character type vectors. Comparing "testset" and "trueset" elements in their existing form will only result in observation #2 as a match. Look at the code below which explains this.

 print(df$testset)
## [1] "0 0 6 65 73 41" "42"             "90 53"          "1"             
## [5] "65 68"
print(df$trueset)
## [1] "43 35 6 65 73 41" "42"               "53 7 60"          "73"              
## [5] "60 68"
df$testset %in% df$trueset
## [1] FALSE  TRUE FALSE FALSE FALSE
Before we start searching the elements from "testset" into "trueset", we will have to split both the character vectors by space. Here is the code for it.

How to split the elements of a character vector in R?

Function strsplit is used to split the character vector in R. This function will return a list object of the same length and elements of this list will contain the split substring of the original vector. Let's split the two vectors and save them into two new variables within the same dataframe. Let's check out the code and output.

df$test <- strsplit(df$testset, " ")
class(df$test)
## [1] "list"
df$test
## [[1]]
## [1] "0"  "0"  "6"  "65" "73" "41"
## 
## [[2]]
## [1] "42"
## 
## [[3]]
## [1] "90" "53"
## 
## [[4]]
## [1] "1"
## 
## [[5]]
## [1] "65" "68"
df$true <- strsplit(df$trueset, " ")
class(df$test)
## [1] "list"
df$true
## [[1]]
## [1] "43" "35" "6"  "65" "73" "41"
## 
## [[2]]
## [1] "42"
## 
## [[3]]
## [1] "53" "7"  "60"
## 
## [[4]]
## [1] "73"
## 
## [[5]]
## [1] "60" "68"
Now when we have the two character vectors split and converted into the list, we can easily access their individual elements and check their availability within other list/vectors. Now to suffice the first requirement, let's search how many elements from "test" are present into "true" in the same observation (same row)?

Before we search the list elements into a vector, let's first look at "how to convert a list into a vector". In the section above we have converted the character vector into a list and now we are converting the list back to the character vector. Believe me, this is required for the exercise we are doing here.

How to convert a list to a vector?

A list can be converted to a vector by using unlist function. Typically it takes a list or a vector as an input and returns a vector with the elements of the supplied list of the vector. Let's convert the two new variables we have added to the original dataframe.

unlist(df$test)
## [1] "0 0 6 65 73 41" "42"             "90 53"          "1"             
## [5] "65 68"
unlist(df$true)
## [1] "43 35 6 65 73 41" "42"               "53 7 60"          "73"              
## [5] "60 68"
Now when we have all the data available to us in the form we require, we must start searching the elements from the variable "test" into the variable "true". Before we do it, we must remember that the two new columns are of type list.

How to search a list element into a vector?

Let's look at the code to search an individual list element within another vector and understand it.
sapply(seq_along(df$true), function(idx) sum(df$true[[idx]] %in% unlist(df$test[idx])))
Function seq_along creates a sequence for each element of the list. In the case of the variable "true", the following will be the result of seq_along function.
seq_along(df$true)
## [1] 1 2 3 4 5
Each sequence number above represent the list elements as below.
## [1] "43 35 6 65 73 41" "42"               "53 7 60"          "73"              
## [5] "60 68"
Next function used above is sapply. This function returns a list of the same length but it preserves the order of the dimensions and dimension name of the argument. This is a variant of the lapply function. Each element of the returned object is the result of the function defined within the sapply function. Code part function(idx) is the signature of a custom function that we have defined where the actual element search takes place. Within this function, nth element of the list is identified (note the [[]] notation) and searched within another vector. 

To explain it further, the first iteration will take 1 from the output of seq_along function. This value will be passed to the custom function within variable idx. Using notation [[]] on variable "true" which is a list, all elements "43 35 6 65 73 41" from the first position will be available to process. This followed by the use of %in% operator, which will help check if any of the element from the iteration 1 is in the vector which is created by unlist(df$test[idx]) expression. Every element from the first position "43 35 6 65 73 41" will be matched against the vector returned by unlist(df$test[idx]) which is "0 0 6 65 73 41". Every element which matches is returned a boolean value of 1 which is then summed up for all matches by the sum function. If you see both vectors "43 35 6 65 73 41" and "0 0 6 65 73 41" there are four elements in common and hence the result would be 4. This process will be repeated for all the observations.

Now when we have the answer for the first question, let's store this into the original dataframe.

How to add a new variable (column) into an existing data frame?

Adding a new variable (column) to an existing dataframe is simple. Following code describes it.
df$a.result <-
        sapply(seq_along(df$true), function(idx) sum(df$true[[idx]] %in% unlist(df$test[idx])))
Giving the name of the new column and putting it with dataframe with the $ operator will create a new variable as long as a valid expression is being assigned to it.

Now for the second part of the question to calculate the unmatched elements, all we have to do is the opposite. Which can be achieved by putting a negation sign "!" in front of the expression within sum function.

df$b.result <-
        sapply(seq_along(df$true), function(idx) sum(!(df$test[[idx]] %in% unlist(df$true[idx]))))
Here is the complete code and result.

###############################
##www.dataenq.com
###################################
library(tidyverse)

#Make new dataframe
df <- data.frame(item = c("blood", "bone", "central", "skin", "soft"),
                           testset = c("0 0 6 65 73 41", "42", "90 53", "1", "65 68"),
                           trueset = c("43 35 6 65 73 41", "42", "53 7 60", "73", "60 68"))


df$test <- strsplit(df$testset, " ")
df$true <- strsplit(df$trueset, " ")

# use mapply to loop through and apply the function to the data frame over the
# two lists and add result into new variable
df$a.result <-
        sapply(seq_along(df$true), function(idx) sum(df$true[[idx]] %in% unlist(df$test[idx])))
df$b.result <-
        sapply(seq_along(df$true), function(idx) sum(!(df$test[[idx]] %in% unlist(df$true[idx]))))

# print the final data frame
df %>% select(item, testset, trueset, a.result, b.result)
##      item        testset          trueset a.result b.result
## 1   blood 0 0 6 65 73 41 43 35 6 65 73 41        4        2
## 2    bone             42               42        1        0
## 3 central          90 53          53 7 60        1        1
## 4    skin              1               73        0        1
## 5    soft          65 68            60 68        1        1
I hope you like this article, if so then please do comment and share it. Thank you.
Image Credit unsplash.com

Comments