读开源项目系列1:python开发的一些简单语法和方法

ainatipen 发表于 2022-1-11 13:48

在读一些python生信项目的开源代码，记录和回忆一下其中关键的语法和用到的包,语法是不需要记的，但是还是需要记录，所以一些很基础的东西还是要记一下
Python类的概念

Python 面向对象 | 菜鸟教程 (runoob.com)
#!/usr/bin/python# -*- coding: UTF-8 -*- class Employee: '所有员工的基类' empCount = 0 def __init__(self, name, salary):    self.name = name    self.salary = salary    Employee.empCount += 1    def displayCount(self): print "Total Employee %d" % Employee.empCount def displayEmployee(self):    print "Name : ", self.name,", Salary: ", self.salary "创建 Employee 类的第一个对象"emp1 = Employee("Zara", 2000)"创建 Employee 类的第二个对象"emp2 = Employee("Manni", 5000)emp1.displayEmployee()emp2.displayEmployee()print ("Total Employee %d" % Employee.empCount)#Name :Zara ,Salary:2000#Name :Manni ,Salary:5000#Total Employee 2enumerate() 函数

enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列，同时列出数据和数据下标，一般用在 for 循环当中。
seasons = ['Spring', 'Summer', 'Fall', 'Winter']list(enumerate(seasons))#[(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')] list(enumerate(seasons, start=1))    # 下标从 1 开始#[(1, 'Spring'), (2, 'Summer'), (3, 'Fall'), (4, 'Winter')]i = 0seq = ['one', 'two', 'three']for element in seq: print i, seq i +=1#0 one#1 two#2 threezip() 函数

a = b = c = zipped = zip(a,b) # 打包为元组的列表## [(1, 4), (2, 5), (3, 6)]itertools.combinations

itertools --- 为高效循环而创建迭代器的函数 — Python 3.10.1 文档
Python使用combinations可以实现排列组合
from itertools import combinationstest_data = ['a1', 'a2', 'a3', 'b']for i in combinations(test_data, 2): print (i)#('a1', 'a2')#('a1', 'a3')#('a'1, 'b')#('a2', 'a3')#('a2', 'b')#('a3', 'b')## 如果只想看a1和其他的比较的话z=0for i in combinations(test_data, 2): if z < (len(test_data) - 1):    print (i) z+=1#('a1', 'a2')#('a1', 'a3')#('a'1, 'b') mappy包

这是minimap2的python版本

lh3/minimap2: A versatile pairwise aligner for genomic and spliced nucleotide sequences (github.com)

This class describes an alignment. An object of this class has the following properties:
ctg: name of the reference sequence the query is mapped toctg_len: total length of the reference sequencer_st and r_en: start and end positions on the referenceq_st and q_en: start and end positions on the querystrand: +1 if on the forward strand; -1 if on the reverse strandmapq: mapping qualityblen: length of the alignment, including both alignment matches and gaps but excluding ambiguous bases.mlen: length of the matching bases in the alignment, excluding ambiguous base matches.NM: number of mismatches, gaps and ambiguous positions in the alignmenttrans_strand: transcript strand. +1 if on the forward strand; -1 if on the reverse strand; 0 if unknownis_primary: if the alignment is primary (typically the best and the first to generate)read_num: read number that the alignment corresponds to; 1 for the first read and 2 for the second readcigar_str: CIGAR stringcigar: CIGAR returned as an array of shape (n_cigar,2). The two numbers give the length and the operator of each CIGAR operation.MD: the MD tag as in the SAM format. It is an empty string unless the MD argument is applied when calling mappy.Aligner.map().cs: the cs tag.
mappy · PyPI
pip install --user mappyimport mappy as mpa = mp.Aligner("test/MT-human.fa")# load or build indexif not a: raise Exception("ERROR: failed to load/build index")s = a.seq("MT_human", 100, 200) # retrieve a subsequence from the indexprint(mp.revcomp(s))             # reverse complementfor name, seq, qual in mp.fastx_read("test/MT-orang.fa"): # read a fasta/q sequence    for hit in a.map(seq): # traverse alignments             print("{}\t{}\t{}\t{}".format(hit.ctg, hit.r_st, hit.r_en, hit.cigar_str))google.protobuf

Google protobuf是非常出色的开源工具，在项目中可以用它来作为服务间数据交互的接口，例如rpc服务、数据文件传输等。protobuf为proto文件中定义的对象提供了标准的序列化和反序列化方法，可以很方便的对pb对象进行各种解析和转换。
Protobuf的介绍和使用 - 简书 (jianshu.com)
Google protobuf使用技巧和经验 - 张巩武 - 博客园 (cnblogs.com)
hashlib

Python的hashlib提供了常见的摘要算法，如MD5，SHA1等等。
什么是摘要算法呢？摘要算法又称哈希算法、散列算法。它通过一个函数，把任意长度的数据转换为一个长度固定的数据串（通常用16进制的字符串表示）。
摘要算法之所以能指出数据是否被篡改过，就是因为摘要函数是一个单向函数，计算f(data)很容易，但通过digest反推data却非常困难。而且，对原始数据做一个bit的修改，都会导致计算出的摘要完全不同
hashlib - 廖雪峰的官方网站 (liaoxuefeng.com)
import hashlibmd5 = hashlib.md5()md5.update('how to use md5 in python hashlib?'.encode('utf-8'))print(md5.hexdigest())
md5生成一个128bit的结果，通常用32位的16进制字符串表示
sha1生成一个160bit的结果，通常用40位的16进制字符串表示
SHA256和SHA512，不过越安全的算法越慢，而且摘要长度更长
import hashlibsha1 = hashlib.sha1()sha1.update('how to use sha1 in ')sha1.update('python hashlib?')print sha1.hexdigest()
这个模块针对许多不同的安全哈希和消息摘要算法实现了一个通用接口。包括 FIPS 安全哈希算法 SHA1, SHA224, SHA256, SHA384 和 SHA512 (定义于 FIPS 180-2) 以及 RSA 的 MD5 算法 (定义于互联网 RFC 1321)。术语 "安全哈希" 和 "消息摘要" 是同义的。较旧的算法被称为消息摘要。现代的术语是安全哈希。
hashlib --- 安全哈希与消息摘要 — Python 3.10.1 文档
import hashlibm = hashlib.sha256()m.update(b"Nobody inspects")m.update(b" the spammish repetition")m.digest()m.digest_sizem.block_sizeos.path.expanduser

注:就是把相对路径改为绝对路径，对于用户的迁移比较友好

os.path --- 常用路径操作 — Python 3.10.1 文档
在 Unix 和 Windows 上，将参数中开头部分的 ~ 或 ~user 替换为当前用户的家目录并返回。
path = os.path.expanduser('~/Project')path##/home/username/Projectmath.inf 常量

Python math.inf 常量-CJavaPy
# Import math Libraryimport math# 打印正无穷大print (math.inf)# 打印负无穷print (-math.inf)python 中 array 和 list 的区别

python 中 array 和 list 的区别 - 知乎 (zhihu.com)

其实python的array和R之中的数组是很类似的，就是多维的数据
collections --- 容器数据类型

OrderedDict
字典的子类，保存了他们被添加的顺序
defaultdict
字典的子类，提供了一个工厂函数，为字典查询提供一个默认值

image.png

Counter

有点像R之中的table

一个 Counter 是一个 dict 的子类，用于计数可哈希对象。它是一个集合，元素像字典键(key)一样存储，它们的计数存储为值。计数可以是任何整数值，包括0和负数。 Counter 类有点像其他语言中的 bags或multisets。
# Tally occurrences of words in a listcnt = Counter()for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']: cnt += 1print(cnt)## Counter({'blue': 3, 'red': 2, 'green': 1})# Find the ten most common words in Hamletimport rewords = re.findall(r'\w+', open('hamlet.txt').read().lower())Counter(words).most_common(10)logging --- Python 的日志记录工具

记录器有以下的属性和方法。注意永远不要直接实例化记录器，应当通过模块级别的函数 logging.getLogger(name) 。多次使用相同的名字调用 getLogger() 会一直返回相同的 Logger 对象的引用。
import logginglogger = logging.getLogger(__name__)functools --- 高阶函数和可调用对象上的操作

@functools.cache(user_function)
@cachedef factorial(n): return n * factorial(n-1) if n else 1>>> factorial(10)    # no previously cached result, makes 11 recursive calls3628800>>> factorial(5)    # just looks up cached value result120>>> factorial(12)    # makes two new recursive calls, the other 10 are cached479001600pickle --- Python 对象序列化

其实和R的保存差不多

要序列化某个包含层次结构的对象，只需调用 dumps() 函数即可。同样，要反序列化数据流，可以调用 loads() 函数。但是，如果要对序列化和反序列化加以更多的控制，可以分别创建 Pickler 或 Unpickler 对象。
Python数据存储：pickle模块的使用讲解_coffee_cream的博客-CSDN博客_import pickle
tqdm

增加进度条
tqdm/tqdm: A Fast, Extensible Progress Bar for Python and CLI (github.com)
seq 9999999 | tqdm --bytes | wc -lglob()函数

glob是python自己带的一个文件操作相关模块，用它可以查找符合自己目的的文件，类似于Windows下的文件搜索，支持通配符操作，,?,[]这三个通配符，代表0个或多个字符，?代表一个字符，[]匹配指定范围内的字符，如匹配数字。两个主要方法如下。
Python glob()函数的作用和用法_xjp_xujiping的博客-CSDN博客_glob()函数
glob方法

其实就是一种简单正则

glob模块的主要方法就是glob,该方法返回所有匹配的文件路径列表（list）；该方法需要一个参数用来指定匹配的路径字符串（字符串可以为绝对路径也可以为相对路径），其返回的文件名只包括当前目录里的文件名，不包括子文件夹里的文件。
glob.glob(r’c:*.txt’) filenames = glob.glob(os.path.join(root, "**", basename), recursive=True)iglob方法：

获取一个迭代器（ iterator ）对象，使用它可以逐个获取匹配的文件路径名。与glob.glob()的区别是：glob.glob同时获取所有的匹配路径，而 glob.iglob一次只获取一个匹配路径。
f = glob.iglob(r'../*.py')print f<generator object iglob at 0x00B9FF80> for py in f: print pypytest

快速对python脚本进行测试的python工具，测试成功失败都有相应提示

快速入门 — learning-pytest 1.0 文档

Python Pytest 教程|极客教程 (geek-docs.com)
Pytest 使用手册 — learning-pytest 1.0 文档
pyro:基于pytorch的概率编程语言（PPL）

pyro-ppl/pyro: Deep universal probabilistic programming with Python and PyTorch (github.com)
参考文档

Pyro Documentation — Pyro documentation
Pyro 推断简介 — Pyro Tutorials 编译 Pyro官方教程汉化

除了用于合并观察数据的 pyro.condition 之外，Pyro还包含 pyro.do，这是 Pearl 的 do-operator 的实现，用于因果推断，其接口与 pyro.condition 相同。condition and do 可以自由混合和组合，使Pyro成为基于模型的因果推断的强大工具。

Pyro 从入门到出门 - 知乎 (zhihu.com)
isinstance

Python isinstance() 函数 | 菜鸟教程 (runoob.com)
assert isinstance(counts, dict)typing --- 类型标注支持

typing模块的作用：
类型检查，防止运行时出现参数和返回值类型不符合。
作为开发文档附加说明，方便使用者调用时传入和返回参数类型。
该模块加入后并不会影响程序的运行，不会报正式的错误，只有提醒。

python模块：typing - 1024搜-程序员专属的搜索引擎 (1024sou.com)

typing-python用于类型注解的库 - lynskylate - 博客园 (cnblogs.com)

assert（断言）
>>> assert True # 条件为 true 正常执行>>> assert False # 条件为 false 触发异常Traceback (most recent call last):File "<stdin>", line 1, in <module>AssertionError>>> assert 1==1 # 条件为 true 正常执行>>> assert 1==2 # 条件为 false 触发异常Traceback (most recent call last):File "<stdin>", line 1, in <module>AssertionError>>> assert 1==2, '1 不等于 2'Traceback (most recent call last):File "<stdin>", line 1, in <module>AssertionError: 1 不等于 2>>>import sysassert ('linux' in sys.platform), "该代码只能在 Linux 下执行"异常处理

Python3 错误和异常 | 菜鸟教程 (runoob.com)
import systry: f = open('myfile.txt') s = f.readline() i = int(s.strip())except OSError as err: print("OS error: {0}".format(err))except ValueError: print("Could not convert data to an integer.")except: print("Unexpected error:", sys.exc_info()) raisefor arg in sys.argv: try:    f = open(arg, 'r') except IOError:    print('cannot open', arg) else:    print(arg, 'has', len(f.readlines()), 'lines')    f.close()@property

其实就是将某个类的属性进行私有化，使得其他人无法乱改类之中的原始参数

@property是什么？使用场景和用法介绍 | Max营销志 (maxlist.xyz)
特性一：将 class （类）的方法转换为只能读取的属性

class Bank_acount: @property def password(self):    return ‘密碼:123'
首先我们先将 class 实例化 andy = Bank_acount（），当我们 print（andy.password）时，可以获得密码：123，当我想对 andy.password 修改时会发现程序出现了 AttributeError： can't set attribute 的错误，这就是 property 只能读取的属性特性
andy = Bank_acount()print(andy.password)>>> 密碼:123andy.password = '密碼:456'>>> AttributeError: can't set attribute
只能读取，那要怎么修改呢？

接下来我们会在特性二看到 property 的 setter、getter 和 deleter 方法
Property 特性二：

class Bank_acount: def __init__(self):    self._password = ‘預設密碼 0000’ @property def password(self):    return self._password @password.setter def password(self, value):    self._password = value @password.deleter def password(self):    del self._password    print('del complite')
getter
andy = Bank_acount()print(andy.password)>>> 預設密碼 0000
setter
andy.password = '1234'print(andy.password)>>> 1234
deleter
del andy.passwordprint(andy.password)>>> del 为什么会需要 @property？

@property 是要实现对象导向中设计中封装的实现方式
使用@property - 廖雪峰的官方网站 (liaoxuefeng.com)
class Student(object): def get_score(self):       return self._score def set_score(self, value):    if not isinstance(value, int):          raise ValueError('score must be an integer!')    if value < 0 or value > 100:          raise ValueError('score must between 0 ~ 100!')    self._score = value
有没有既能检查参数，又可以用类似属性这样简单的方式来访问类的变量呢？对于追求完美的Python程序员来说，这是必须要做到的！
还记得装饰器（decorator）可以给函数动态加上功能吗？对于类的方法，装饰器一样起作用。Python内置的@property装饰器就是负责把一个方法变成属性调用的
class Student(object): @property def score(self):    return self._score @score.setter def score(self, value):    if not isinstance(value, int):          raise ValueError('score must be an integer!')    if value < 0 or value > 100:          raise ValueError('score must between 0 ~ 100!')    self._score = value>>> s = Student()>>> s.score = 60 # OK，实际转化为s.set_score(60)>>> s.score # OK，实际转化为s.get_score()60>>> s.score = 9999Traceback (most recent call last):...ValueError: score must between 0 ~ 100!
这篇讲的最清楚
python @property的用法及含义_昨天丶今天丶明天的的博客-CSDN博客
base64

Base64是一种用64个字符来表示任意二进制数据的方法。
Base64编码的长度永远是4的倍数
shutil --- 高阶文件操作

#!/usr/bin/env python# _*_ coding:utf-8 _*___author__ = 'junxi'import shutil# 将文件内容拷贝到另一个文件中shutil.copyfileobj(open('old.txt', 'r'), open('new.txt', 'w'))# 拷贝文件shutil.copyfile('old.txt', 'old1.txt')# 仅拷贝权限。内容、组、用户均不变shutil.copymode('old.txt', 'old1.txt')# 复制权限、最后访问时间、最后修改时间shutil.copystat('old.txt', 'old1.txt')# 复制一个文件到一个文件或一个目录shutil.copy('old.txt', 'old2.txt')# 在copy上的基础上再复制文件最后访问时间与修改时间也复制过来了shutil.copy2('old.txt', 'old2.txt')# 把olddir拷贝一份newdir，如果第3个参数是True，则复制目录时将保持文件夹下的符号连接，如果第3个参数是False，则将在复制的目录下生成物理副本来替代符号连接shutil.copytree('C:/Users/xiaoxinsoso/Desktop/aaa', 'C:/Users/xiaoxinsoso/Desktop/bbb')# 移动目录或文件shutil.move('C:/Users/xiaoxinsoso/Desktop/aaa', 'C:/Users/xiaoxinsoso/Desktop/bbb') # 把aaa目录移动到bbb目录下# 删除一个目录shutil.rmtree('C:/Users/xiaoxinsoso/Desktop/bbb') # 删除bbb目录subprocess --- 子进程管理

Python模块之subprocess用法实例详解 - 云+社区 - 腾讯云 (tencent.com)
>>> import subprocess# python 解析则传入命令的每个参数的列表>>> subprocess.run(["df","-h"])Filesystem    Size Used Avail Use% Mounted on/dev/mapper/VolGroup-LogVol00       289G70G 204G 26% /tmpfs       64G 064G0% /dev/shm/dev/sda1    283M27M 241M 11% /bootCompletedProcess(args=['df', '-h'], returncode=0)# 需要交给Linux shell自己解析，则:传入命令字符串，shell=True>>> subprocess.run("df -h|grep /dev/sda1",shell=True)/dev/sda1    283M27M 241M 11% /bootCompletedProcess(args='df -h|grep /dev/sda1', returncode=0)为什么python 函数名之后有一个箭头？

7.3 给函数参数增加元信息 — python3-cookbook 3.0.0 文档

只是提示该函数输入参数和返回值的数据类型
方便程序员阅读代码的。

页: [1]

Unity开发者联盟's Archiver

读开源项目系列1:python开发的一些简单语法和方法