文章詳情頁

python - 大文本數據合并問題思路

瀏覽：79日期：2022-08-12 15:46:37

問題描述

背景：

我有三個csv文件，分別如下：

afile: userid, username, ....bfile: postid, userid, postname, ...cfile: postid, postnum, ...

afile = 10Gbfile = 150Gcfile = 20G

注：各個field的分隔符并不是單個字符（例如逗號），而是一串特殊符號，因為部分field可能會包含某些單字符分隔符，鍵盤上的單字符都試過了，都有包含，所以用了一串幾個字符組成的特殊字符串來分隔，所以并不是嚴格的csv，這是最蛋疼的地方

目的：

我想合并這三個文件，bfile和cfile根據postid列合并，合并后再根據userid列合并afile，最終大概是postid, userid, postname, postnum, username這樣的形式。

目前我的偽代碼如下：

import pandas as pdchunksize = 1000000 # 100W 目前看沒問題 try:resultchunktotal = []bfilereader = pd.read_csv(bfile, iterator=True, engine=’python’, sep=’##’)goon_1 = Truewhile goon_1: try:# 分塊讀取 bfilebfilechunk = bfilereader.get_chunk(chunksize)if not bfilechunk.empty: cfilereader = pd.read_csv(cfile, iterator=True, engine=’python’, sep=’##’) goon_2 = True while goon_2:try: # 分塊讀取 cfile cfilechunk = cfilereader.get_chunk(chunksize) if not cfilechunk.empty:bfilecfilechunk = pd.merge(bfilechunk, cfilechunk, on=’postid’)# 不為空代表 bfile cfile有共同的postidif not bfilecfilechunk.empty: afilereader = pd.read_csv(afile, iterator=True, engine=’python’, sep=’##’) goon_3 = True while goon_3:try: # 分塊讀取afile afilechunk = afilereader.get_chunk(chunksize) if not afilechunk.empty:chunkresult = pd.merge(bfilecfilechunk, afilechunk, on=’’)# 不為空表示有共同的useridif not chunkresult.empty:resultchunktotal.append(chunkresult)except StopIteration: goon_3 = Falseexcept StopIteration: goon_2 = False except StopIteration:goon_1 = Falseif len(resultchunktotal) > 0: pd.concat(resultchunktotal).to_csv(’result.csv’, index=False) except Exception as e:print(e)

但是感覺這樣，很低效，所以跪求各位大神好的思路以及好的工具方法

ps: 這是一道“大數據”的偽命題，無非數據稍大了點

問題解答

回答1：

別寫代碼啦?？雌饋硎且恍?shell 腳本的事情，用 xsv join 子命令。

Python 編程

上一條：python - 當裝飾器遇到multiprocessing, 出了點bug.下一條：python - sqlalchemy更新數據報錯

相關文章：

1. javascript - 前端開發本地靜態文件頻繁修改，預覽時的緩存怎么解決？2. docker不顯示端口映射呢？3. python - linux怎么在每天的凌晨2點執行一次這個log.py文件4. css - 關于ul的布局5. android - 優酷的安卓及蘋果app還在使用flash技術嗎？6. mysql數據庫每次查詢是一條線程嗎？7. java - public <T> T findOne(T record) 這是什么意思8. html5和Flash對抗是什么情況？9. 小程序怎么加外鏈，語句怎么寫！求救新手，開文檔沒發現10. 如何分別在Windows下用Winform項模板+C#，在MacOSX下用Cocos Application項目模板+Objective-C實現一個制作游戲的空的黑窗口？

排行榜

					
					python - linux怎么在每天的凌晨2點執行一次這個log.py文件
docker不顯示端口映射呢？
javascript - 前端開發 本地靜態文件頻繁修改，預覽時的緩存怎么解決？
android - 優酷的安卓及蘋果app還在使用flash技術嗎？
java - public <T> T findOne(T record) 這是什么意思
css - 關于ul的布局
mysql數據庫每次查詢是一條線程嗎？
新手 - Python 爬蟲 問題 求助
html5和Flash對抗是什么情況？
html - 爬蟲時出現“DNS lookup failed”，打開網頁卻沒問題，這是什么情況？
android - 鍵盤遮擋RecyclerView
				

熱門標簽