17 正则表达式
17.1 grep函数
- 模式查询函数:查找符合某个条件的字符串/文本
17.2 元字符
- 表示自身的字符:字母、数字
- 1个任意字符:.(点号)
- 表示位置:
^$^表示以xx开始,$表示以xx结束 - 表示数量:
*?+星号代表任意数量,加号代表至少1个,问号代表0个或1个 - 括号类 [] 方括号表示含[]中的字符,注意[]与^联用表示否定,不含xx开头 {m,n} 花括号内可填一个数字或两个数字(表示范围) () 圆括号表示组合
- 转义符
\? \*在元字符前加反斜杠()进行转义。 - 其他: Tab(制表符) Enter(换行) 软换行 文字间隔符 任意空格符 不是空格键 任意文字字符(字母 汉字)
例子:任意字符(.点号)与 数量符号(*、+、?)的应用
ss <- c("acb","a b","A b","a#b","a##b","ab")
grep(pattern = "ab",x = ss) #匹配ab
grep(pattern = "a.b",x = ss) #匹配ab中间有1个任意字符
grep(pattern = "a..b",x = ss) #匹配ab中间有2个任意字符
grep(pattern = "a.*b",x = ss) #匹配ab中间有任意数量个任意字符,数量可以为0
grep(pattern = "a.+b",x = ss) #匹配ab中间有至少一个数量个任意字符
grep(pattern = "a.?b",x = ss)#匹配ab中间有0个或1个数量个任意字符## [1] 6
## [1] 1 2 4
## [1] 5
## [1] 1 2 4 5 6
## [1] 1 2 4 5
## [1] 1 2 4 6
例子:任意字符(.点号)与 数量符号({})的应用
ss <- c("acb","a b","A b","a#b","a##b","ab")
grep(pattern = "a.{1,3}b",x = ss) #{}表示一个数字范围,匹配"ab"之间有1至3个任意字符
grep(pattern = "a.{3}b",x = ss) #{}匹配"ab"之间有3个任意字符## [1] 1 2 4 5
## integer(0)
例子:位置字符(^、$)的应用
ss <- c("1314","abc","a b c","ABC","aB12c","13ab14c")
grep(pattern = "^1",x = ss) #匹配以1开始的索引
grep(pattern = "4$",x = ss) #匹配以4结束的索引## [1] 1 6
## [1] 1
例子:方括号([])的应用,注意方括号与^连用表否定
ss <- c("a2b","a1cb","ab","a111b","a1b","acb")
grep(pattern = "a[2c]b" ,x = ss) #ab中间有括号里面(2或c)的任意1个字符
grep(pattern = "a[2c]*b" ,x = ss) #ab中间有任意数量的括号里面(2或c)的任意1个字符
grep(pattern = "a[0-9]b",x = ss) #"ab"中间有 “0到9”中的任意1个字符
grep(pattern = "a[a-z]b",x = ss) #"ab"中间有 “a到z”中的任意1个字符
grep(pattern = "a[^c]b",x = ss) ##这里的^表示否定,“非”的意思。"ab"中间没有以 “c”开头的字符## [1] 1 6
## [1] 1 3 6
## [1] 1 5
## [1] 6
## [1] 1 5
例子:圆括号的应用 可结合gsub函数,替换
ss <- c("1314","abc","a b c","ABC","aB12c","13ab14c")
grep(pattern = "(13).4",x = ss) #匹配有13组合+任意1个字符+4的索引
grep(pattern = "(13|13ab).4",x = ss) #匹配13或1314的组合+任意1个字符+4的索引
gsub(pattern = "(13|13ab).4",replacement = "XXXX",x = ss) #替换13或1314的组合+任意1个字符+4为“XXXX”
gsub(pattern = "(13|13ab).4",replacement = "\\1",x = ss)#替换13或1314的组合+任意1个字符+4为第一次匹配的内容,\\1代表反向引用## [1] 1
## [1] 1 6
## [1] "XXXX" "abc" "a b c" "ABC" "aB12c" "XXXXc"
## [1] "13" "abc" "a b c" "ABC" "aB12c" "13abc"
转义符的应用
ss <- c("acb","a?b","a??b")
grep(pattern = "a\\?b",x = ss) #\\?表示问号本身
grep(pattern = "a\\?+b",x = ss) #\\?+表示至少一个问号
grep(pattern = "^\\w+$",x = ss) #任意个字符开头并结束,及从头到尾全是字符(字母或汉字)
grep(pattern = "\\W+",x = ss) #含有非字符(符合 空格)成分
grep(pattern = "\\d+",x = ss) #含有数字成分
grep(pattern = "\\D+",x = ss) #含有非数字成分## [1] 2
## [1] 2 3
## [1] 1
## [1] 2 3
## integer(0)
## [1] 1 2 3
17.3 常用文本处理函数
- length/nchar (向量元素的个数/每个向量元素字符的长度)
- 模式查询:grep/grepl (索引/返回逻辑值)
- 模式替换 sub/gsub
- 截取/修剪:substr/substring/strtrim
- 合并/拆分:paste/strsplit
- 转换/翻译:tolwer/toupper/chartr
## [1] 4
## [1] 1 2 3 6
ss1 <- c(1,12,123,"abcdef")
substr("abcdef", 2, 4) #截取字符第2到4位
substring("abcdef", first = 1:6, last = 1:6) #
strtrim(ss1,width = 2) #修剪得到每个元素字符前两位## [1] "bcd"
## [1] "a" "b" "c" "d" "e" "f"
## [1] "1" "12" "12" "ab"
## [1] "1 ###" "12 ###" "123 ###" "abcdef ###"
## [1] "1-###" "12-###" "123-###" "abcdef-###"
## [1] "1-###;12-###;123-###;abcdef-###"
strsplit("abc def",split = "") #每个字符拆分
strsplit("abc def",split = " ") #空格符拆分
strsplit("abc def",split = " +") #split可跟正则表达式,“空格+”表示一个或多个空格拆分空格符拆分
ss1 <- c(1,12,123,"abcdef")
strsplit(ss1,split = "")## [[1]]
## [1] "a" "b" "c" " " "d" "e" "f"
##
## [[1]]
## [1] "abc" "def"
##
## [[1]]
## [1] "abc" "def"
##
## [[1]]
## [1] "1"
##
## [[2]]
## [1] "1" "2"
##
## [[3]]
## [1] "1" "2" "3"
##
## [[4]]
## [1] "a" "b" "c" "d" "e" "f"
17.4 案例
17.4.1 查找指定目录下的jpg图像文件(list.files函数)
dic
dirctory="../front_end/fig/"
file_name=list.files(path = dirctory, #文件路径
pattern = "\\.jpg$", #正则表达式
recursive = TRUE, #是否迭代子目录
full.names = TRUE, #是否显示文件全部路径
ignore.case = TRUE) #是否忽略大小写
file_name
file.size(file_name) #查看文件大小,单位为KB## [1] "../front_end/fig//appendix/100-01.jpg"
## [2] "../front_end/fig//appendix/100-02.jpg"
## [3] "../front_end/fig//appendix/100-03.jpg"
## [4] "../front_end/fig//appendix/ps-copy.jpg"
## [5] "../front_end/fig//appendix/切图导出.jpg"
## [6] "../front_end/fig//appendix/图层.jpg"
## [7] "../front_end/fig//case/80-01.jpg"
## [8] "../front_end/fig//case/80-02.jpg"
## [9] "../front_end/fig//case/80-03.jpg"
## [10] "../front_end/fig//case/80-04.jpg"
## [11] "../front_end/fig//case/80-05.jpg"
## [12] "../front_end/fig//case/80-06.jpg"
## [13] "../front_end/fig//case/80-07.jpg"
## [14] "../front_end/fig//CSS/1-01.jpg"
## [15] "../front_end/fig//CSS/1-02.jpg"
## [16] "../front_end/fig//CSS/1-03.jpg"
## [17] "../front_end/fig//CSS/101-01.jpg"
## [18] "../front_end/fig//CSS/19-01.jpg"
## [19] "../front_end/fig//CSS/19-02.jpg"
## [20] "../front_end/fig//CSS/19-03.jpg"
## [21] "../front_end/fig//CSS/19-04.jpg"
## [22] "../front_end/fig//CSS/19-05.jpg"
## [23] "../front_end/fig//CSS/19-06.jpg"
## [24] "../front_end/fig//CSS/19-20.jpg"
## [25] "../front_end/fig//CSS/19-21.jpg"
## [26] "../front_end/fig//CSS/19-22.jpg"
## [27] "../front_end/fig//CSS/19-23.jpg"
## [28] "../front_end/fig//CSS/19-30.jpg"
## [29] "../front_end/fig//CSS/20-01.jpg"
## [30] "../front_end/fig//CSS/20-02.jpg"
## [31] "../front_end/fig//CSS/20-10.jpg"
## [32] "../front_end/fig//CSS/20-11.jpg"
## [33] "../front_end/fig//CSS/20-12.jpg"
## [34] "../front_end/fig//CSS/20-15.jpg"
## [35] "../front_end/fig//CSS/20-16.jpg"
## [36] "../front_end/fig//CSS/20-17.jpg"
## [37] "../front_end/fig//CSS/20-18.jpg"
## [38] "../front_end/fig//CSS/20-20.jpg"
## [39] "../front_end/fig//CSS/20-21.jpg"
## [40] "../front_end/fig//CSS/20-22.jpg"
## [41] "../front_end/fig//CSS/20-25.jpg"
## [42] "../front_end/fig//CSS/20-26.jpg"
## [43] "../front_end/fig//CSS/21-01.jpg"
## [44] "../front_end/fig//CSS/21-02.jpg"
## [45] "../front_end/fig//CSS/21-03.jpg"
## [46] "../front_end/fig//CSS/21-04.jpg"
## [47] "../front_end/fig//CSS/21-05.jpg"
## [48] "../front_end/fig//CSS/21-06.jpg"
## [49] "../front_end/fig//CSS/21-07.jpg"
## [50] "../front_end/fig//CSS/21-08.jpg"
## [51] "../front_end/fig//CSS/21-09.jpg"
## [52] "../front_end/fig//CSS/21-10.jpg"
## [53] "../front_end/fig//CSS/21-14.jpg"
## [54] "../front_end/fig//CSS/21-15.jpg"
## [55] "../front_end/fig//CSS/21-16.jpg"
## [56] "../front_end/fig//CSS/21-17.jpg"
## [57] "../front_end/fig//CSS/21-18.jpg"
## [58] "../front_end/fig//CSS/5-01.jpg"
## [59] "../front_end/fig//CSS/6-01.jpg"
## [60] "../front_end/fig//CSS/CSS三角.jpg"
## [61] "../front_end/fig//CSS/h5新增语义标签.jpg"
## [62] "../front_end/fig//CSS/翻页.jpg"
## [63] "../front_end/fig//CSS/内马尔.jpg"
## [64] "../front_end/fig//CSS/品优购整体图.jpg"
## [65] "../front_end/fig//CSS/三角加强.jpg"
## [66] "../front_end/fig//CSS/视口.jpg"
## [67] "../front_end/fig//CSS/淘宝焦点案例.jpg"
## [68] "../front_end/fig//JS/01/01.jpg"
## [69] "../front_end/fig//JS/01/ASCII.jpg"
## [70] "../front_end/fig//JS/01/statement.jpg"
## [71] "../front_end/fig//JS/01/案例-渲染学生信息.jpg"
## [72] "../front_end/fig//JS/01/断点.jpg"
## [73] "../front_end/fig//JS/01/断点2.jpg"
## [74] "../front_end/fig//JS/01/断点3.jpg"
## [75] "../front_end/fig//JS/02/DOM树.jpg"
## [76] "../front_end/fig//JS/02/JS语法分类.jpg"
## [77] "../front_end/fig/1-1.jpg"
## [78] "../front_end/fig/1-10.jpg"
## [79] "../front_end/fig/1-11.jpg"
## [80] "../front_end/fig/1-12.jpg"
## [81] "../front_end/fig/1-2.jpg"
## [82] "../front_end/fig/1-20.jpg"
## [83] "../front_end/fig/1-21.jpg"
## [84] "../front_end/fig/1-22.jpg"
## [85] "../front_end/fig/1-23.jpg"
## [86] "../front_end/fig/1-3.jpg"
## [87] "../front_end/fig/1-30.jpg"
## [88] "../front_end/fig/1-31.jpg"
## [89] "../front_end/fig/1-32.jpg"
## [90] "../front_end/fig/1-33.jpg"
## [91] "../front_end/fig/1-4.jpg"
## [92] "../front_end/fig/1-41.jpg"
## [93] "../front_end/fig/1-5.jpg"
## [94] "../front_end/fig/1-51.jpg"
## [95] "../front_end/fig/1-6.jpg"
## [96] "../front_end/fig/1-7.jpg"
## [97] "../front_end/fig/17-02.jpg"
## [98] "../front_end/fig/17-11.jpg"
## [99] "../front_end/fig/18-01.jpg"
## [1] 13029 11057 49485 153758 361807 338243 58444 19010 7625 28207
## [11] 13162 87024 23939 13528 9028 3909 147563 98126 61088 35949
## [21] 24838 22270 14697 10954 13845 37489 20298 24826 10215 11651
## [31] 22618 17758 14780 25265 22767 28288 36725 18726 15727 69040
## [41] 11881 59069 17018 43281 11027 14447 40416 30519 53498 21247
## [51] 36862 19355 11262 14066 7132 10024 5582 6286 17656 2997
## [61] 31630 16079 41010 128708 5517 68877 62464 30827 363379 67013
## [71] 38112 89882 176514 161340 61809 30913 92362 14271 6913 7374
## [81] 58411 45855 7360 8515 13057 56081 14049 13009 22879 26817
## [91] 50769 71105 99416 51007 144687 58105 936815 10285 23057
## [1] "abc" "abcxx"
## [1] "ABC" "ABCXX"
## [1] "UAGCGGG"