17 正则表达式

17.1 grep函数

  • 模式查询函数:查找符合某个条件的字符串/文本
ss <- c("1314","abc","a b c","ABC","aB12c","13ab14c")
grep(pattern = "ab",x = ss) #ss向量中包含"ab"的索引
grep(pattern = "ab",x = ss,ignore.case = TRUE) #忽略大小写
grep(pattern = "ab",x = ss,value = TRUE)

17.1.1 函数参数

  • pattern
  • x:字符向量
  • ignore.case:是否忽略大小写
  • value:是否返回值(而非索引)

17.1.2 模式(pattern)与匹配(match)

  • 模式(pattern):符合某种条件的“表达式”,如“姓张的同学”、“以数字开头的行”
  • 匹配:用““模式”去比较/查找/筛选的过程称为“匹配”

17.2 元字符

  • 表示自身的字符:字母、数字
  • 1个任意字符:.(点号)
  • 表示位置:^$ ^表示以xx开始,$表示以xx结束
  • 表示数量:*?+ 星号代表任意数量,加号代表至少1个,问号代表0个或1个
  • 括号类 [] 方括号表示含[]中的字符,注意[]与^联用表示否定,不含xx开头 {m,n} 花括号内可填一个数字或两个数字(表示范围) () 圆括号表示组合
  • 转义符\? \*在元字符前加反斜杠()进行转义。
  • 其他: Tab(制表符) Enter(换行) 软换行 文字间隔符 任意空格符 不是空格键 任意文字字符(字母 汉字)

例子:任意字符(.点号)与 数量符号(*、+、?)的应用

ss <- c("acb","a b","A b","a#b","a##b","ab")
grep(pattern = "ab",x = ss) #匹配ab
grep(pattern = "a.b",x = ss) #匹配ab中间有1个任意字符
grep(pattern = "a..b",x = ss) #匹配ab中间有2个任意字符
grep(pattern = "a.*b",x = ss) #匹配ab中间有任意数量个任意字符,数量可以为0
grep(pattern = "a.+b",x = ss) #匹配ab中间有至少一个数量个任意字符
grep(pattern = "a.?b",x = ss)#匹配ab中间有0个或1个数量个任意字符
## [1] 6
## [1] 1 2 4
## [1] 5
## [1] 1 2 4 5 6
## [1] 1 2 4 5
## [1] 1 2 4 6

例子:任意字符(.点号)与 数量符号({})的应用

ss <- c("acb","a b","A b","a#b","a##b","ab")
grep(pattern = "a.{1,3}b",x = ss) #{}表示一个数字范围,匹配"ab"之间有1至3个任意字符
grep(pattern = "a.{3}b",x = ss) #{}匹配"ab"之间有3个任意字符
## [1] 1 2 4 5
## integer(0)

例子:位置字符(^、$)的应用

ss <- c("1314","abc","a b c","ABC","aB12c","13ab14c")
grep(pattern = "^1",x = ss) #匹配以1开始的索引
grep(pattern = "4$",x = ss) #匹配以4结束的索引
## [1] 1 6
## [1] 1

例子:方括号([])的应用,注意方括号与^连用表否定

ss <- c("a2b","a1cb","ab","a111b","a1b","acb")
grep(pattern = "a[2c]b" ,x = ss) #ab中间有括号里面(2或c)的任意1个字符
grep(pattern = "a[2c]*b" ,x = ss) #ab中间有任意数量的括号里面(2或c)的任意1个字符
grep(pattern = "a[0-9]b",x = ss) #"ab"中间有 “0到9”中的任意1个字符
grep(pattern = "a[a-z]b",x = ss) #"ab"中间有 “a到z”中的任意1个字符

grep(pattern = "a[^c]b",x = ss) ##这里的^表示否定,“非”的意思。"ab"中间没有以 “c”开头的字符
## [1] 1 6
## [1] 1 3 6
## [1] 1 5
## [1] 6
## [1] 1 5

例子:圆括号的应用 可结合gsub函数,替换

ss <- c("1314","abc","a b c","ABC","aB12c","13ab14c")

grep(pattern = "(13).4",x = ss) #匹配有13组合+任意1个字符+4的索引
grep(pattern = "(13|13ab).4",x = ss) #匹配13或1314的组合+任意1个字符+4的索引
gsub(pattern = "(13|13ab).4",replacement = "XXXX",x = ss) #替换13或1314的组合+任意1个字符+4为“XXXX”
gsub(pattern = "(13|13ab).4",replacement = "\\1",x = ss)#替换13或1314的组合+任意1个字符+4为第一次匹配的内容,\\1代表反向引用
## [1] 1
## [1] 1 6
## [1] "XXXX"  "abc"   "a b c" "ABC"   "aB12c" "XXXXc"
## [1] "13"    "abc"   "a b c" "ABC"   "aB12c" "13abc"

转义符的应用

ss <- c("acb","a?b","a??b")

grep(pattern = "a\\?b",x = ss) #\\?表示问号本身
grep(pattern = "a\\?+b",x = ss) #\\?+表示至少一个问号
grep(pattern = "^\\w+$",x = ss) #任意个字符开头并结束,及从头到尾全是字符(字母或汉字)
grep(pattern = "\\W+",x = ss) #含有非字符(符合 空格)成分
grep(pattern = "\\d+",x = ss) #含有数字成分
grep(pattern = "\\D+",x = ss) #含有非数字成分
## [1] 2
## [1] 2 3
## [1] 1
## [1] 2 3
## integer(0)
## [1] 1 2 3

17.3 常用文本处理函数

  • length/nchar (向量元素的个数/每个向量元素字符的长度)
  • 模式查询:grep/grepl (索引/返回逻辑值)
  • 模式替换 sub/gsub
  • 截取/修剪:substr/substring/strtrim
  • 合并/拆分:paste/strsplit
  • 转换/翻译:tolwer/toupper/chartr
ss1 <- c(1,12,123,"abcdef")

length(ss1)
nchar(ss1)
## [1] 4
## [1] 1 2 3 6
ss1 <- c(1,12,123,"abcdef")

substr("abcdef", 2, 4) #截取字符第2到4位
substring("abcdef", first = 1:6, last = 1:6)  #
strtrim(ss1,width = 2) #修剪得到每个元素字符前两位
## [1] "bcd"
## [1] "a" "b" "c" "d" "e" "f"
## [1] "1"  "12" "12" "ab"
ss1 <- c(1,12,123,"abcdef")

paste(ss1,"###")
## [1] "1 ###"      "12 ###"     "123 ###"    "abcdef ###"
paste(ss1,"###",sep = "-") #sep 两个字符间的分隔符
## [1] "1-###"      "12-###"     "123-###"    "abcdef-###"
paste(ss1,"###",sep = "-",collapse = ";") #collapse 将多个向量元素融合
## [1] "1-###;12-###;123-###;abcdef-###"
strsplit("abc def",split = "") #每个字符拆分
strsplit("abc def",split = " ") #空格符拆分
strsplit("abc  def",split = " +") #split可跟正则表达式,“空格+”表示一个或多个空格拆分空格符拆分

ss1 <- c(1,12,123,"abcdef")
strsplit(ss1,split = "")
## [[1]]
## [1] "a" "b" "c" " " "d" "e" "f"
## 
## [[1]]
## [1] "abc" "def"
## 
## [[1]]
## [1] "abc" "def"
## 
## [[1]]
## [1] "1"
## 
## [[2]]
## [1] "1" "2"
## 
## [[3]]
## [1] "1" "2" "3"
## 
## [[4]]
## [1] "a" "b" "c" "d" "e" "f"

17.4 案例

17.4.1 查找指定目录下的jpg图像文件(list.files函数)

dic

dirctory="../front_end/fig/"
file_name=list.files(path = dirctory, #文件路径
                     pattern = "\\.jpg$", #正则表达式
                     recursive = TRUE, #是否迭代子目录
                     full.names = TRUE, #是否显示文件全部路径
                     ignore.case = TRUE) #是否忽略大小写
file_name
file.size(file_name) #查看文件大小,单位为KB
##  [1] "../front_end/fig//appendix/100-01.jpg"        
##  [2] "../front_end/fig//appendix/100-02.jpg"        
##  [3] "../front_end/fig//appendix/100-03.jpg"        
##  [4] "../front_end/fig//appendix/ps-copy.jpg"       
##  [5] "../front_end/fig//appendix/切图导出.jpg"      
##  [6] "../front_end/fig//appendix/图层.jpg"          
##  [7] "../front_end/fig//case/80-01.jpg"             
##  [8] "../front_end/fig//case/80-02.jpg"             
##  [9] "../front_end/fig//case/80-03.jpg"             
## [10] "../front_end/fig//case/80-04.jpg"             
## [11] "../front_end/fig//case/80-05.jpg"             
## [12] "../front_end/fig//case/80-06.jpg"             
## [13] "../front_end/fig//case/80-07.jpg"             
## [14] "../front_end/fig//CSS/1-01.jpg"               
## [15] "../front_end/fig//CSS/1-02.jpg"               
## [16] "../front_end/fig//CSS/1-03.jpg"               
## [17] "../front_end/fig//CSS/101-01.jpg"             
## [18] "../front_end/fig//CSS/19-01.jpg"              
## [19] "../front_end/fig//CSS/19-02.jpg"              
## [20] "../front_end/fig//CSS/19-03.jpg"              
## [21] "../front_end/fig//CSS/19-04.jpg"              
## [22] "../front_end/fig//CSS/19-05.jpg"              
## [23] "../front_end/fig//CSS/19-06.jpg"              
## [24] "../front_end/fig//CSS/19-20.jpg"              
## [25] "../front_end/fig//CSS/19-21.jpg"              
## [26] "../front_end/fig//CSS/19-22.jpg"              
## [27] "../front_end/fig//CSS/19-23.jpg"              
## [28] "../front_end/fig//CSS/19-30.jpg"              
## [29] "../front_end/fig//CSS/20-01.jpg"              
## [30] "../front_end/fig//CSS/20-02.jpg"              
## [31] "../front_end/fig//CSS/20-10.jpg"              
## [32] "../front_end/fig//CSS/20-11.jpg"              
## [33] "../front_end/fig//CSS/20-12.jpg"              
## [34] "../front_end/fig//CSS/20-15.jpg"              
## [35] "../front_end/fig//CSS/20-16.jpg"              
## [36] "../front_end/fig//CSS/20-17.jpg"              
## [37] "../front_end/fig//CSS/20-18.jpg"              
## [38] "../front_end/fig//CSS/20-20.jpg"              
## [39] "../front_end/fig//CSS/20-21.jpg"              
## [40] "../front_end/fig//CSS/20-22.jpg"              
## [41] "../front_end/fig//CSS/20-25.jpg"              
## [42] "../front_end/fig//CSS/20-26.jpg"              
## [43] "../front_end/fig//CSS/21-01.jpg"              
## [44] "../front_end/fig//CSS/21-02.jpg"              
## [45] "../front_end/fig//CSS/21-03.jpg"              
## [46] "../front_end/fig//CSS/21-04.jpg"              
## [47] "../front_end/fig//CSS/21-05.jpg"              
## [48] "../front_end/fig//CSS/21-06.jpg"              
## [49] "../front_end/fig//CSS/21-07.jpg"              
## [50] "../front_end/fig//CSS/21-08.jpg"              
## [51] "../front_end/fig//CSS/21-09.jpg"              
## [52] "../front_end/fig//CSS/21-10.jpg"              
## [53] "../front_end/fig//CSS/21-14.jpg"              
## [54] "../front_end/fig//CSS/21-15.jpg"              
## [55] "../front_end/fig//CSS/21-16.jpg"              
## [56] "../front_end/fig//CSS/21-17.jpg"              
## [57] "../front_end/fig//CSS/21-18.jpg"              
## [58] "../front_end/fig//CSS/5-01.jpg"               
## [59] "../front_end/fig//CSS/6-01.jpg"               
## [60] "../front_end/fig//CSS/CSS三角.jpg"            
## [61] "../front_end/fig//CSS/h5新增语义标签.jpg"     
## [62] "../front_end/fig//CSS/翻页.jpg"               
## [63] "../front_end/fig//CSS/内马尔.jpg"             
## [64] "../front_end/fig//CSS/品优购整体图.jpg"       
## [65] "../front_end/fig//CSS/三角加强.jpg"           
## [66] "../front_end/fig//CSS/视口.jpg"               
## [67] "../front_end/fig//CSS/淘宝焦点案例.jpg"       
## [68] "../front_end/fig//JS/01/01.jpg"               
## [69] "../front_end/fig//JS/01/ASCII.jpg"            
## [70] "../front_end/fig//JS/01/statement.jpg"        
## [71] "../front_end/fig//JS/01/案例-渲染学生信息.jpg"
## [72] "../front_end/fig//JS/01/断点.jpg"             
## [73] "../front_end/fig//JS/01/断点2.jpg"            
## [74] "../front_end/fig//JS/01/断点3.jpg"            
## [75] "../front_end/fig//JS/02/DOM树.jpg"            
## [76] "../front_end/fig//JS/02/JS语法分类.jpg"       
## [77] "../front_end/fig/1-1.jpg"                     
## [78] "../front_end/fig/1-10.jpg"                    
## [79] "../front_end/fig/1-11.jpg"                    
## [80] "../front_end/fig/1-12.jpg"                    
## [81] "../front_end/fig/1-2.jpg"                     
## [82] "../front_end/fig/1-20.jpg"                    
## [83] "../front_end/fig/1-21.jpg"                    
## [84] "../front_end/fig/1-22.jpg"                    
## [85] "../front_end/fig/1-23.jpg"                    
## [86] "../front_end/fig/1-3.jpg"                     
## [87] "../front_end/fig/1-30.jpg"                    
## [88] "../front_end/fig/1-31.jpg"                    
## [89] "../front_end/fig/1-32.jpg"                    
## [90] "../front_end/fig/1-33.jpg"                    
## [91] "../front_end/fig/1-4.jpg"                     
## [92] "../front_end/fig/1-41.jpg"                    
## [93] "../front_end/fig/1-5.jpg"                     
## [94] "../front_end/fig/1-51.jpg"                    
## [95] "../front_end/fig/1-6.jpg"                     
## [96] "../front_end/fig/1-7.jpg"                     
## [97] "../front_end/fig/17-02.jpg"                   
## [98] "../front_end/fig/17-11.jpg"                   
## [99] "../front_end/fig/18-01.jpg"                   
##  [1]  13029  11057  49485 153758 361807 338243  58444  19010   7625  28207
## [11]  13162  87024  23939  13528   9028   3909 147563  98126  61088  35949
## [21]  24838  22270  14697  10954  13845  37489  20298  24826  10215  11651
## [31]  22618  17758  14780  25265  22767  28288  36725  18726  15727  69040
## [41]  11881  59069  17018  43281  11027  14447  40416  30519  53498  21247
## [51]  36862  19355  11262  14066   7132  10024   5582   6286  17656   2997
## [61]  31630  16079  41010 128708   5517  68877  62464  30827 363379  67013
## [71]  38112  89882 176514 161340  61809  30913  92362  14271   6913   7374
## [81]  58411  45855   7360   8515  13057  56081  14049  13009  22879  26817
## [91]  50769  71105  99416  51007 144687  58105 936815  10285  23057
tolower(c("aBc","ABCxx")) #全部小写
toupper(c("aBc","ABCxx")) #全部大写
## [1] "abc"   "abcxx"
## [1] "ABC"   "ABCXX"
xseq <- "ATCGCCC"
chartr(old = "ATCG",new = "UAGC",x = xseq) #旧字符翻译成新字符,如A翻译成U,T翻译成A...
## [1] "UAGCGGG"

17.4.2 全宋词

gsub(pattern = ".\\s",replacement = "",x = songci) #去除文本向量中所有空格符
sentence=grep(pattern = "|",x = songci,value = TRUE) #查找含“杨”或“柳”的句子
tail(sentence)

17.4.3 SCI(筛选行)

筛选出杂志名包含NATURE,且19年IF>3的行

grepl函数返回TURE or FALSE,表示向量中每个元素是否满足正则表达式

data
ss1 <- grepl(pattern = "nature",x = data$tittle,ignore.case = TRUE)
ss2 <- data$IF > 3
dt[ss1 & ss2,]