newLISP你也行 --- 字符串.

by address-withheld@my.opera.com.invalid (F0)

at 2012-05-25 12:34:04

original http://my.opera.com/freewinger/blog/show.dml/49009172

#############################################################################
# Name:         newLISP你也行 --- 流
# Author:       黄登(winger)
# Project:      http://code.google.com/p/newlisp-you-can-do
# Gtalk:        free.winger@gmail.com
# Gtalk-Group: zen0code@appspot.com
# Blog:         http://my.opera.com/freewinger/blog/
# QQ-Group:     31138659
# 大道至简 -- newLISP
#
# Copyright 2012 黄登(winger) All rights reserved.
# Permission is granted to copy, distribute and/or
# modify this document under the terms of the GNU Free Documentation License,
# Version 1.2 or any later version published by the Free Software Foundation;
# with no Invariant Sections, no Front-Cover Texts,and no Back-Cover Texts.
#############################################################################

        自由固不是钱所买到的，但能够为钱而卖掉。        --- 鲁迅

    现实中, 在人和计算机交互中, 涉及到最多的就是字符串了.
    以至于大部分的数据输入都被当做字符串来处理.
    如果说列表是天地, 那字符串就一定是这天地间的横流.

一. newLISP中的字符串
    Strings in newLISP code

    newLISP 处理字符串的能力无疑是强大的, 各种方便的刀具都给你备齐了, 每一把都
是居家宅男, 杀码越货, 的必备神器.

    广告完毕, 言归正传.~_~~

    在nl里有三种方法可以表示字符串:

    用双引号围起来 ;优点按键更少, 而且转义字符有效, 比如"\n"
    (set 's "this is a string")

    用花括号围起来 ;优点过滤一切转义字符
    (set 's {this is a string})

    用专门的标识码围起来 ;除了上面的优点外,他还可以构造大于2048字节的字符串
    (set 's [text]this is a string[/text])

    第一和第二中方法构建的字符串不能超过 2048 个字节.
    很多人会觉得既然有了第二种, 为什么还要有第一种?
    让我们测试下下面的代码

> {{}

ERR: string token too long : "\{}"

> """
"""

    看到没, 花括号的好处就是过滤一切的转义字符, 转义字符到了里面没有任何作用.
如果你要print 一个字符串:

> (print {\n road to freedom})
\n road to freedom"\n road to freedom"
> (print "\n road to freedom")

road to freedom"\n road to freedom"

    花括号内内的转义字符没效了, 根本没换行. 这三种方法就第一种方法, 可以在内部
使用自己的TAG 双引号.

    第二种方法, 花括号, 这种方法我是非常鼓励使用的, 为什么, 方便啊, 不用在转义
字符前加个反斜杠了, 在构造正则表达式的时候尤其好用.

> (println "\t45")
        45
"\t45"
> (println "\t45")
\t45
"\t45"
> (println {\t45})
\t45
"\t45"

> (regex "\d" "a9b6c4")
("9" 1 1)

> (regex {\d} "a9b6c4")
("9" 1 1)

    字符串通常支持以下几种转义字符:

character   description
\"          for a double quote inside a quoted string
\n          for a line-feed character (ASCII 10)
\r          for a return character (ASCII 13)
\t          for a TAB character (ASCII 9)
\nnn        for a three-digit ASCII number (nnn format between 000 and 255)
\xnn        for a two-digit-hex ASCII number (xnn format between x00 and xff)

(set 's "this is a string \n with two lines")
(println s)

this is a string
with two lines

(println "\110\101\119\076\073\083\080") ; 十进制 ASCII
newLISP

(println "\x6e\x65\x77\x4c\x49\x53\x50") ; 十六进制 ASCII
newLISP

    如果要你反过来把字符串写成上面的各种数字字符串, 该怎么呢?
    提示: 用 format 和 unpack .

    第三种[text] [\text] 通常用来处理超长的字符串数据(大于 2048 字节), 比如web
页面. nL 在传递长字符串的时候, 也会自动使用这种格式.

(set 'novel (read-file {my-latest-novel.txt}))
;->
[text]
It was a dark and "stormy" night...
...
The End.
[/text]

    使用 length 可以得到字符串的长度:

(length novel)
;-> 575196

    newLISP 可以高效的处理数百万的字符串.
    如果要统计unicode 字符串的长度, 必须使用utf8 版本的 newLISP:

(utf8len (char 955))
;-> 1
(length (char 955))
;-> 2
> (utf8len "个")
4
> (length "个")
2

    cmd.exe 在处理非ascii 字符的时候会产生很多问题, 几乎无法解决, 但是非Win32
的 console 没这个问题.

二. 构造字符串
    Making strings

    有N种方法构造字符串. 到处都是字符串. 遍地都是字符串...
    如果想一个一个字符的构造的话可以用 char :

(char 33)
;-> "!"

> (char "a")
97

> (char 0x61)
"a"

> (char 97)
"a"

    char 只能处理一个字符, 他可以将字符转换成数字, 也可以将数字转换成字符.

(join (map char (sequence (char "a") (char "z"))))
;-> "abcdefghijklmnopqrstuvwxyz"

    char 获得 "a" 和 "z" ascii码, 然后用sequence 产生一个数字序列, 接着用map
映射 char 函数到每个数字, 产生数字相对应的字符. 最后join 将整个列表合成一个字
符串.

    我们也可以给 join 传递一个参数, 做分隔符.

(join (map char (sequence (char "a") (char "z"))) "-")
;-> "a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z"

    和 join 类似 append 也可以连接字符串. (大部分的列表函数也可用于字符串)

(append "con" "cat" "e" "nation")
;-> "concatenation"

    构造列表的时候我们用list , 构造字符串我们用string .
    string 可以将各种参数组合成, 一个字符串.

(define x 42)
(string {the value of } 'x { is } x)
;-> "the value of x is 42"

    更精细的字符串输出可以使用format , 稍后就会见到.
    dup 可以复制字符串:

> (dup "帅锅" 5)
"帅锅帅锅帅锅帅锅帅锅"

    date 会产生一个包含当前时间信息的字符串.

> (date)
"Mon May 14 15:50:34 2012"

> (date 1234567890)
"Sat Feb 14 07:31:30 2009"

三. 字符串手术
    String surgery

    这里不知道怎么翻译鸟, 手术啊. 听起来很恐怖. 其实就是永久性改变.

-     很多函数都可以操作字符串, 部分是具有破坏性的(destructive 这些函数在手册
里, 都有一个 ! 标志).

(set 't "a hypothetical one-dimensional subatomic particle")
(reverse t)
;-> "elcitrap cimotabus lanoisnemid-eno lacitehtopyh a"
t
;-> "elcitrap cimotabus lanoisnemid-eno lacitehtopyh a"

    之前已经说过要用这些函数又不想破坏原来的数据, 就要用 copy.

(reverse (copy t))
;-> "elcitrap cimotabus lanoisnemid-eno lacitehtopyh a"
t
;-> "a hypothetical one-dimensional subatomic particle"

    上面的reverse 永久性的改变了 t. 但是下面的大小写转换函数, 却不会改变原字符
串.

(set 't "a hypothetical one-dimensional subatomic particle")
(upper-case t)
;-> "A HYPOTHETICAL ONE-DIMENSIONAL SUBATOMIC PARTICLE"
(lower-case t)
;-> "a hypothetical one-dimensional subatomic particle"
(title-case t)
;-> "A hypothetical one-dimensional subatomic particle"
t
;-> "a hypothetical one-dimensional subatomic particle"

四. 子串
    Substrings

    如果需要抽取字符串中的一部分可以用以下的方法:

(set 't "a hypothetical one-dimensional subatomic particle")
(first t)
;-> "a"
(rest t)
;-> " hypothetical one-dimensional subatomic particle"
(last t)
;-> "e"
(t 2)
;-> "h"

    你会发现这和上一章介绍的列表操作好像. 在nL里头大部分的列表操作函数, 也同样
可以操作字符串. 其中就包括各种选取函数.

1: 字符串分片
    String slices

    slice 可以将从一个现存的字符串中, 分割出一个新的字符串.

(set 't "a hypothetical one-dimensional subatomic particle")
(slice t 15 13) ;从第15个位置开始, 提取出出13个字符
;-> "one-dimension"
(slice t -8 8) ;从倒数第8个位置开始, 提取出8个字符
;-> "particle"
(slice t 2 -9) ;从第2个位置开始, 提取到倒数第9个字符为止(第9个字符不算)
;-> "hypothetical one-dimensional subatomic"
(slice "schwarzwalderkirschtorte" 19 -1) ;同上, 最后一个字符不取
;-> "tort"

    当然, 字符串也可以用隐式操作.

(15 13 t)
;-> "one-dimension"
(0 14 t)
;-> "a hypothetical"

    上面提取的字符串都是连续的. 如果要抽取出分散的字符. 就得用 select :

(set 't "a hypothetical one-dimensional subatomic particle")
(select t 3 5 24 48 21 10 44 8)
;-> "yosemite"
(select t (sequence 1 49 12)) ; 从第1个字符开始, 每隔12个提取出一个字符
;-> " lime"

> (help select)
syntax: (select <string> <list-selection>)
syntax: (select <string> [<int-index_i> ... ])

     <list-selection> 列表中包含了要提取的字符的位置.

2: 改变字符串的首位
    Changing the ends of strings

    chop 和 trim 可以给字符串做收尾切除术, 他们都具破坏性.
    切切切...

    chop 只能切除一个指定位置的字符...

(chop t) ; 默认是最后一个字符
;-> "a hypothetical one-dimensional subatomic particl"
(chop t 9) ; 切除第9个字符
;-> "a hypothetical one-dimensional subatomic"

    trim 修剪掉存在于字符串头尾的指定字符.

(set 's " centred ")
(trim s) ; defaults to removing spaces
;-> "centred"

(set 's "------centred------")
(trim s "-")
;-> "centred"

(set 's "------centred*******")
(trim s "-" "") ;可以分别指定需要修剪的头和尾 "字符"
;-> "centred"

3: push 和 pop 字符串
    push and pop work on strings too

    push 可以将元素压入指定字符串的指定位置. pop 相反.
    如果没有指定位置, 默认为字符串的第一个位置.

(set 't "some ")
(push "this is " t)
(push "text " t -1)
;-> t is now "this is some text"

    push 和 pop 都返回压入或者弹出的元素, 而不是目标字符串. 这样操作大的字符串
时, 就会更快. 否则你就得用slice 屏蔽输出了.

>(help pop)
syntax: (pop <str> [<int-index> [<int-length>]])

    可以指定pop字符的数量, [<int-length>] .

(set 'version-string (string (sys-info -2)))
; eg: version-string is "10402"
(set 'dev-version (pop version-string -2 2)) ; 总是两个数字
; version-string is now "02"
(set 'point-version (pop version-string -1)) ; 总是一个数字
; version-string is now "4"
(set 'version version-string) ; 一位或者两位 99?
(println version "." point-version "." dev-version " on " ostype)
10.4.02 on Win32
"Win32"

    ostype 返回操作系统类型.

五. 修改字符串
    Modifying strings

    有两种方法修改字符串, 一种, 指定具体的位置. 第二种指定特定的内容.

1: 通过索引修改字符串
    Using index numbers in strings

    好久以前是有nth-set 和 set-nth 的, 不过鉴于各种 set 和被 set , 其操作方法
和返回值的复杂性. 在现今的版本中, 他们都已经消失不见了. 不过我们可以使用隐式索
引, 操作访问指定位置的元素.

> (set 'str "thinking newLISP !")
"thinking newLISP !"
> (setf (str 0) "I t")
"I T"
> str
"I Thinking newLISP !"

2: 改变字符串的子串
    Changing substrings

    很多时候, 你无法确切的知道, 需要操作的字符的索引, 或者找出来的代价太大.\
    这时候就可以用replace 替换所有符合自己要求的字符串部分...

> (help replace)
syntax: (replace <str-key> <str-data> <exp-replacement>)
syntax: (replace <str-pattern> <str-data> <exp-replacement> <int-regex-option>)

(replace old-string source-string replacement)
So:
(set't "a hypothetical one-dimensional subatomic particle")
(replace "hypoth" t "theor") ;将字符串中所有的hypoth替换成theor
;-> "a theoretical one-dimensional subatomic particle"

replace 是破坏性函数, 如果你不想改变原来的字符串, 可以使用copy 或者 string :

(set't "a hypothetical one-dimensional subatomic particle")
(replace "hypoth" (string t) "theor")
;-> "a theoretical one-dimensional subatomic particle"
t
;-> "a hypothetical one-dimensional subatomic particle"

Google Reader