-
Extraction of numbers from a character string
2013-12-01
SourceSometimes you get data like:
## concentration temperature pH ## 1 2.12mL 11 C 7.0 ## 2 7.5mL -1 C 10.5 ## 3 0.7mL 3 C 8.0 ## 4 7.6mL 5 C 7.5 ## 5 0.11mL 8 C 11.0 ## 6 2.13mL 4 C 4.0 ## 7 0.27mL 5 C 10.0 ## 8 0.45mL 4 C 8.5 ## 9 0.17mL 9 C 7.5 ## 10 0.96mL 5 C 5.5
and you'd like to separate the numerical values from the units in columns such as
concentration
ortemperature
.@BondedDust and @tcash21 kindly provide me some help on stackoverflow to achieve such a task.
Extract numbers from a character string
I'm using the following function to extract the numerical values:
library(stringr) numextract <- function(string){ str_extract(string, "\\-*\\d+\\.*\\d*") }
For example:
numextract("30.5ml")
## [1] "30.5"
numextract(">2g")
## [1] "2"
The function is vectorized:
numextract(c("30.5ml", "37ml"))
## [1] "30.5" "37"
It also extracts negative numbers:
numextract("-3°C")
## [1] "-3"
If you want to ignore the minus sign, delete it in
str_extract(string, "\\-*\\d+\\.*\\d*")
.If there are multiple numbers in the character string, it only extracts the first one:
numextract("between 2ml and 3ml")
## [1] "2"
(yes, I have already received such data in my life).
If you want to extarct multiple numbers use this function instead:
Numextract <- function(string){ unlist(regmatches(string,gregexpr("[[:digit:]]+\\.*[[:digit:]]*",string))) } Numextract("between 2ml and 3ml")
## [1] "2" "3"
Separate numbers and units
Consider the problem of the introduction:
dat
## concentration temperature pH ## 1 2.12mL 11 C 7.0 ## 2 7.5mL -1 C 10.5 ## 3 0.7mL 3 C 8.0 ## 4 7.6mL 5 C 7.5 ## 5 0.11mL 8 C 11.0 ## 6 2.13mL 4 C 4.0 ## 7 0.27mL 5 C 10.0 ## 8 0.45mL 4 C 8.5 ## 9 0.17mL 9 C 7.5 ## 10 0.96mL 5 C 5.5
Extract the numerical values of the
concentration
column( concentration <- numextract(dat$concentration) )
## [1] "2.12" "7.5" "0.7" "7.6" "0.11" "2.13" "0.27" "0.45" "0.17" "0.96"
Sometimes, units are different, and we want to get them :
( unit <- str_split_fixed(dat$concentration, concentration, n=2)[,2] )
## [1] "mL" "mL" "mL" "mL" "mL" "mL" "mL" "mL" "mL" "mL"
And it is a good practice to check we are able to reconstruct the original column:
all(paste0(concentration, unit) == dat$concentration)
## [1] TRUE
Now we'd like to replace the original column by these two columns:
( concentration <- data.frame(concentration, unit) )
## concentration unit ## 1 2.12 mL ## 2 7.5 mL ## 3 0.7 mL ## 4 7.6 mL ## 5 0.11 mL ## 6 2.13 mL ## 7 0.27 mL ## 8 0.45 mL ## 9 0.17 mL ## 10 0.96 mL
To do so, I'm using the following function:
df_replace <- function(dat, col, newcol){ if(!(col %in% names(dat))) stop(paste0("column ", col, " is not in ", deparse(substitute(dat)))) if(is.vector(newcol)){ newcol <- data.frame(newcol, stringsAsFactors=FALSE) newcol <- reshape::rename(newcol, replace=(c("newcol"=col))) } E <- which(names(dat) == col) if(E==1) return(cbind(newcol, dat[,(E++1):length(dat),drop=FALSE])) if(E==length(dat)) return(cbind(dat[,1:(E-1),drop=FALSE], newcol)) else return(cbind(dat[,1:(E-1),drop=FALSE], newcol, dat[,(E++1):length(dat),drop=FALSE])) }
like this:
df_replace(dat, "concentration", concentration)
## concentration unit temperature pH ## 1 2.12 mL 11 C 7.0 ## 2 7.5 mL -1 C 10.5 ## 3 0.7 mL 3 C 8.0 ## 4 7.6 mL 5 C 7.5 ## 5 0.11 mL 8 C 11.0 ## 6 2.13 mL 4 C 4.0 ## 7 0.27 mL 5 C 10.0 ## 8 0.45 mL 4 C 8.5 ## 9 0.17 mL 9 C 7.5 ## 10 0.96 mL 5 C 5.5
Handling character data
Assume you have a long dataset made of columns in character mode only (see next section about the import of Excel files in R). Even the columns containing numerical values are given in character mode, and you need an automatic way to convert these columns in numeric mode.
First, I'm using the following function to check whether a column can be safely converted in numerical mode:
isNumeric <- function(string){ ## test whether a character vector contains numbers only x <- numextract(string) notNA <- which(!is.na(x)) ifelse(length(notNA)==0, FALSE, all(x[notNA]==str_trim(string[notNA]))) }
x <- c("1.5", "3.2") isNumeric(x)
## [1] TRUE
x <- factor(x) isNumeric(x)
## [1] TRUE
x <- c("1.5", "3.2 ") isNumeric(x)
## [1] TRUE
x <- c("1.5", "<3.2") isNumeric(x)
## [1] FALSE
If
isNumeric(x)
returnsTRUE
, the you can safely useas.numeric(x)
to convert
ifx
is a character vector, but you have to useas.numeric(as.character(x))
ifx
is a factor. You can simply useas.numeric(as.character(x))
for both cases.Based on
isNumeric
, the functionNumerize
below takes as input a dataframe containing only character or factor columns, and returns the same dataframe with numerical columns when possible.Numerize <- function(dat, factors=TRUE){ # factorcols <- if(factors) FALSE else vapply(dat, function(col) is.factor(col), TRUE) numcols <- names(dat)[which(vapply(dat, function(col) isNumeric(col), TRUE) & !factorcols)] sapply(numcols, function(col) invisible(dat[[col]] <<- as.numeric(as.character(dat[[col]])))) return(dat) }
For example:
Numerize(data.frame(x=c("1.5", "3.2"), y=c("1.5", "<3.2")))
## x y ## 1 1.5 1.5 ## 2 3.2 <3.2