需求:
工作中需要计算货币政策不确定性等大量数据,需要从每个数据源中提取灵动文本值,对其进行统计分析汇总,便于后续进行回归及其他实证设计。
解决:
python自带的linecache模块中getline方法简单好用,可直接提取文件中对应行的内容并进行统计分析
#!/usr/bin/python # -- coding: UTF-8 -- #data_filter import os,sys,linecache info = os.getcwd() fout = open('data_filter.txt_', 'w') def writeintofile(info): fin = open(info) need ='' for lineno in range(360,363): need_temp = linecache.getline(info,lineno) #提取对应行内容 need += need_temp data = need + info +'\n' strinfo = data fout.write(strinfo) fin.close() for root, dirs, files in os.walk(info): if len(dirs) == 0: for fl in files: info = "%s\%s" % (root,fl) if info[-3:]=='txt': #遍历文本 writeintofile(info) fout.close() raw_input('Finished....Write BY Tom \nEnter Exit' ) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.
由于os.walk默认是按文件名顺序遍历的,这里需要按生成时间提取内容,所以引入os.path.getmtime()方法,将文件创建时间作为key,文件名作为value,定义字典,将key排序后,输出value。改进后的方法如下,不知道是否有更好方法:
#!/usr/bin/python # -- coding: UTF-8 -- #data_filter import os,sys,linecache info = os.getcwd() fout = open('data_filter.txt_', 'w') d = {} #struct a dictionary save file_time as a key,and filename as a value for root, dirs, files in os.walk(info): for file in files: file_time = os.path.getmtime(file) d[file_time] =file def writeintofile(info): fin = open(info) need ='' for lineno in range(360,363): need_temp = linecache.getline(info,lineno) need += need_temp data = need + info +'\n' strinfo = data fout.write(strinfo) fin.close() L = d.keys() L.sort() #时间排序 for file_time in L: #print d[file_time] # for test if d[file_time][-3:] =='txt': writeintofile(d[file_time]) fout.close() raw_input('Finished....Write BY Tom \nEnter Exit' )
数据来源: 货币政策不确定性数据