概述
基于 DDC 方法论(第 2.3 章),此技能支持使用大型语言模型(LLM)实现建筑数据处理的自动化。无需手动编写数据转换代码,您只需用自然语言描述需求,LLM 即可生成必要的 Python/Pandas 代码。
书籍参考: 《Pandas DataFrame и LLM ChatGPT》/《Pandas DataFrame and LLM ChatGPT》
"LLM 模型(如 ChatGPT 和 LLaMA)使没有深厚编程知识的专家能够为公司业务流程的自动化和改进做出贡献。"
— DDC 书籍,第 2.3 章
快速入门
选项 1:使用在线 ChatGPT/Claude
用自然语言描述您的数据处理任务:
提示词:"编写 Python 代码,读取包含建筑材料的 Excel 文件,筛选 quantity > 100 的行,并保存为 CSV。"
选项 2:运行本地 LLM(Ollama)
# 从 ollama.com 安装 Ollama
ollama pull mistral # 运行查询
ollama run mistral "编写 Pandas 代码,从 quantity unit_price 计算总成本"
选项 3:使用 LM Studio(图形界面)
- 从 lmstudio.ai 下载
- 安装并选择模型(如 Mistral、LLaMA)
- 开始与本地 AI 聊天
核心概念
DataFrame 作为通用格式
import pandas as pd# 建筑项目作为 DataFrame
# 行 = 元素,列 = 属性
df = pd.DataFrame({
'element_id': ['W001', 'W002', 'C001'],
'category': ['Wall', 'Wall', 'Column'],
'material': ['Concrete', 'Brick', 'Steel'],
'volume_m3': [45.5, 32.0, 8.2],
'cost_per_m3': [150, 80, 450]
})
# 计算总成本
df['total_cost'] = df['volume_m3'] df['cost_per_m3']
print(df)
建筑任务的 LLM 提示词
数据导入:
"编写代码导入 Excel 文件中的建筑进度表,解析日期,并创建 Pandas DataFrame"
数据筛选:
"筛选 category 为 'Structural' 且成本超过预算限制 50000 的建筑元素"
数据聚合:
"按楼层 level 对建筑数据进行分组,计算每层的总体积和成本"
报告生成:
"创建按 category 分组的材料数量汇总报告,导出为带格式的 Excel"
常见用例
1. 从 PDF 文档提取数据
# 给 ChatGPT 的提示词:
# "编写代码从 PDF 提取表格并转换为 DataFrame"
import pdfplumber
import pandas as pddef pdf_to_dataframe(pdf_path):
"""从 PDF 文件提取表格"""
all_tables = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table:
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)
if all_tables:
return pd.concat(all_tables, ignore_index=True)
return pd.DataFrame()
# 用法
df = pdf_to_dataframe("construction_spec.pdf")
df.to_excel("extracted_data.xlsx", index=False)
2. 处理 BIM 元素数据
# 提示词:"分析 BIM 元素,按 category 分组,计算体积"
import pandas as pddef analyze_bim_elements(csv_path):
"""从 CSV 导出分析 BIM 元素数据"""
df = pd.read_csv(csv_path)
# 按 category 分组汇总
summary = df.groupby('Category').agg({
'Volume': 'sum',
'Area': 'sum',
'ElementId': 'count'
}).rename(columns={'ElementId': 'Count'})
return summary
# 用法
summary = analyze_bim_elements("revit_export.csv")
print(summary)
3. 成本估算流程
# 提示词:"从数量和单价创建成本估算"
import pandas as pddef calculate_cost_estimate(quantities_df, prices_df):
"""
计算项目成本估算
参数:
quantities_df: 包含列 [item_code, quantity] 的 DataFrame
prices_df: 包含列 [item_code, unit_price, unit] 的 DataFrame
返回:
包含成本计算的 DataFrame
"""
# 将数量与价格合并
result = quantities_df.merge(prices_df, on='item_code', how='left')
# 计算成本
result['total_cost'] = result['quantity'] result['unit_price']
# 添加汇总
result['cost_percentage'] = (result['total_cost'] / result['total_cost'].sum() 100).round(2)
return result
# 用法
quantities = pd.DataFrame({
'item_code': ['C001', 'S001', 'W001'],
'quantity': [150, 2000, 500]
})
prices = pd.DataFrame({
'item_code': ['C001', 'S001', 'W001'],
'unit_price': [120, 45, 85],
'unit': ['m3', 'kg', 'm2']
})
estimate = calculate_cost_estimate(quantities, prices)
print(estimate)
4. 进度数据处理
# 提示词:"解析建筑进度,计算工期,识别延误"
import pandas as pd
from datetime import datetimedef analyze_schedule(schedule_path):
"""分析建筑进度以识别延误"""
df = pd.read_excel(schedule_path)
# 解析日期
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
df['actual_end'] = pd.to_datetime(df['actual_end'])
# 计算工期
df['planned_duration'] = (df['end_date'] - df['start_date']).dt.days
df['actual_duration'] = (df['actual_end'] - df['start_date']).dt.days
# 识别延误
df['delay_days'] = df['actual_duration'] - df['planned_duration']
df['is_delayed'] = df['delay_days'] > 0
return df
# 用法
schedule = analyze_schedule("project_schedule.xlsx")
delayed_tasks = schedule[schedule['is_delayed']]
print(f"延误任务数: {len(delayed_tasks)}")
本地 LLM 设置(无需互联网)
使用 Ollama
# 安装
curl -fsSL https://ollama.com/install.sh | sh# 下载模型
ollama pull mistral # 通用目的,7B 参数
ollama pull codellama # 代码导向
ollama pull deepseek-coder # 最佳编码任务
# 运行
ollama run mistral "编写 Pandas 代码,按 project_id 合并两个 DataFrame"
使用 LlamaIndex 处理公司文档
# 将公司文档加载到本地 LLM
from llama_index import SimpleDirectoryReader, VectorStoreIndex# 从文件夹读取所有 PDF
reader = SimpleDirectoryReader("company_documents/")
documents = reader.load_data()
# 创建可搜索索引
index = VectorStoreIndex.from_documents(documents)
# 查询您的文档
query_engine = index.as_query_engine()
response = query_engine.query(
"标准混凝土配合比规格是什么?"
)
print(response)
IDE 推荐
| IDE | 最佳用途 | 特性 |
|---|
| Jupyter Notebook | 学习、实验 | 交互式单元、可视化 |
| Google Colab | 免费 GPU、快速启动 | 云端、预装库 |
| VS Code | 专业开发 | 扩展、GitHub Copilot |
| PyCharm | 大型项目 | 高级调试、重构 |
Jupyter 快速设置
pip install jupyter pandas openpyxl pdfplumber
jupyter notebook
最佳实践
- 从简单开始:从清晰、具体的提示词开始
- 迭代优化:根据结果优化提示词
- 验证检查:运行前始终检查生成的代码
- 文档记录:保存可重用的有效提示词
- 安全保密:对敏感公司数据使用本地 LLM
常用提示词库
数据导入
- "读取 Excel 文件并显示前 10 行"
- "导入带自定义分隔符和编码的 CSV"
- "将多个 Excel 工作表加载为 DataFrame 字典"
数据清洗
- "基于 element_id 删除重复行"
- "用列均值填充缺失值"
- "转换列为数值类型,处理错误"
数据分析
- "计算数值列的描述性统计"
- "找出成本与工期的相关性"
- "使用 IQR 方法识别异常值"
数据导出
- "导出到多工作表的 Excel"
- "保存为指定编码的 CSV"
- "生成格式化的 PDF 报告"
资源
- 书籍:《Data-Driven Construction》,作者 Artem Boiko,第 2.3 章
- 网站:https://datadrivenconstruction.io
- Pandas 文档:https://pandas.pydata.org/docs/
- Ollama:https://ollama.com
- LM Studio:https://lmstudio.ai
- Google Colab:https://colab.research.google.com
下一步
- 参阅
pandas-construction-analysis 了解高级 Pandas 操作
- 参阅
pdf-to-structured 了解文档处理
- 参阅
etl-pipeline 了解自动化数据管道
- 参阅
rag-construction 了解建筑文档的 RAG 实现
# LLM Data Automation for Construction
Overview
Based on DDC methodology (Chapter 2.3), this skill enables automation of construction data processing using Large Language Models (LLM). Instead of manually coding data transformations, you describe what you need in natural language, and the LLM generates the necessary Python/Pandas code.
Book Reference: "Pandas DataFrame и LLM ChatGPT" / "Pandas DataFrame and LLM ChatGPT"
"LLM-модели, такие как ChatGPT и LLaMA, позволяют специалистам без глубоких знаний программирования внести свой вклад в автоматизацию и улучшение бизнес-процессов компании."
— DDC Book, Chapter 2.3
Quick Start
Option 1: Use ChatGPT/Claude Online
Simply describe your data processing task in natural language:
``
Prompt: "Write Python code to read an Excel file with construction materials,
filter rows where quantity > 100, and save to CSV."
`
Option 2: Run Local LLM (Ollama)
`
bash
# Install Ollama from ollama.com
ollama pull mistral
# Run a query
ollama run mistral "Write Pandas code to calculate total cost from quantity unit_price"
`
Option 3: Use LM Studio (GUI)
- Download from lmstudio.ai
- Install and select a model (e.g., Mistral, LLaMA)
- Start chatting with your local AI
Core Concepts
DataFrame as Universal Format
`
python
import pandas as pd
# Construction project as DataFrame
# Rows = elements, Columns = attributes
df = pd.DataFrame({
'element_id': ['W001', 'W002', 'C001'],
'category': ['Wall', 'Wall', 'Column'],
'material': ['Concrete', 'Brick', 'Steel'],
'volume_m3': [45.5, 32.0, 8.2],
'cost_per_m3': [150, 80, 450]
})
# Calculate total cost
df['total_cost'] = df['volume_m3'] df['cost_per_m3']
print(df)
`
LLM Prompts for Construction Tasks
Data Import:
`
"Write code to import Excel file with construction schedule,
parse dates, and create a Pandas DataFrame"
`
Data Filtering:
`
"Filter construction elements where category is 'Structural'
and cost exceeds budget limit of 50000"
`
Data Aggregation:
`
"Group construction data by floor level,
calculate total volume and cost for each floor"
`
Report Generation:
`
"Create summary report with material quantities grouped by category,
export to Excel with formatting"
`
Common Use Cases
1. Extract Data from PDF Documents
`
python
# Prompt to ChatGPT:
# "Write code to extract tables from PDF and convert to DataFrame"
import pdfplumber
import pandas as pd
def pdf_to_dataframe(pdf_path):
"""Extract tables from PDF file"""
all_tables = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table:
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)
if all_tables:
return pd.concat(all_tables, ignore_index=True)
return pd.DataFrame()
# Usage
df = pdf_to_dataframe("construction_spec.pdf")
df.to_excel("extracted_data.xlsx", index=False)
`
2. Process BIM Element Data
`
python
# Prompt: "Analyze BIM elements, group by category, calculate volumes"
import pandas as pd
def analyze_bim_elements(csv_path):
"""Analyze BIM element data from CSV export"""
df = pd.read_csv(csv_path)
# Group by category
summary = df.groupby('Category').agg({
'Volume': 'sum',
'Area': 'sum',
'ElementId': 'count'
}).rename(columns={'ElementId': 'Count'})
return summary
# Usage
summary = analyze_bim_elements("revit_export.csv")
print(summary)
`
3. Cost Estimation Pipeline
`
python
# Prompt: "Create cost estimation from quantities and unit prices"
import pandas as pd
def calculate_cost_estimate(quantities_df, prices_df):
"""
Calculate project cost estimate
Args:
quantities_df: DataFrame with columns [item_code, quantity]
prices_df: DataFrame with columns [item_code, unit_price, unit]
Returns:
DataFrame with cost calculations
"""
# Merge quantities with prices
result = quantities_df.merge(prices_df, on='item_code', how='left')
# Calculate costs
result['total_cost'] = result['quantity'] result['unit_price']
# Add summary
result['cost_percentage'] = (result['total_cost'] /
result['total_cost'].sum() 100).round(2)
return result
# Usage
quantities = pd.DataFrame({
'item_code': ['C001', 'S001', 'W001'],
'quantity': [150, 2000, 500]
})
prices = pd.DataFrame({
'item_code': ['C001', 'S001', 'W001'],
'unit_price': [120, 45, 85],
'unit': ['m3', 'kg', 'm2']
})
estimate = calculate_cost_estimate(quantities, prices)
print(estimate)
`
4. Schedule Data Processing
`
python
# Prompt: "Parse construction schedule, calculate durations, identify delays"
import pandas as pd
from datetime import datetime
def analyze_schedule(schedule_path):
"""Analyze construction schedule for delays"""
df = pd.read_excel(schedule_path)
# Parse dates
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
df['actual_end'] = pd.to_datetime(df['actual_end'])
# Calculate durations
df['planned_duration'] = (df['end_date'] - df['start_date']).dt.days
df['actual_duration'] = (df['actual_end'] - df['start_date']).dt.days
# Identify delays
df['delay_days'] = df['actual_duration'] - df['planned_duration']
df['is_delayed'] = df['delay_days'] > 0
return df
# Usage
schedule = analyze_schedule("project_schedule.xlsx")
delayed_tasks = schedule[schedule['is_delayed']]
print(f"Delayed tasks: {len(delayed_tasks)}")
`
Local LLM Setup (No Internet Required)
Using Ollama
`
bash
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Download models
ollama pull mistral # General purpose, 7B params
ollama pull codellama # Code-focused
ollama pull deepseek-coder # Best for coding tasks
# Run
ollama run mistral "Write Pandas code to merge two DataFrames on project_id"
`
Using LlamaIndex for Company Documents
`
python
# Load company documents into local LLM
from llama_index import SimpleDirectoryReader, VectorStoreIndex
# Read all PDFs from folder
reader = SimpleDirectoryReader("company_documents/")
documents = reader.load_data()
# Create searchable index
index = VectorStoreIndex.from_documents(documents)
# Query your documents
query_engine = index.as_query_engine()
response = query_engine.query(
"What are the standard concrete mix specifications?"
)
print(response)
`
IDE Recommendations
| IDE | Best For | Features |
|-----|----------|----------|
| Jupyter Notebook | Learning, experiments | Interactive cells, visualizations |
| Google Colab | Free GPU, quick start | Cloud-based, pre-installed libs |
| VS Code | Professional development | Extensions, GitHub Copilot |
| PyCharm | Large projects | Advanced debugging, refactoring |
Quick Setup with Jupyter
`
bash
pip install jupyter pandas openpyxl pdfplumber
jupyter notebook
`
Best Practices
- Start Simple: Begin with clear, specific prompts
- Iterate: Refine prompts based on results
- Validate: Always check generated code before running
- Document: Save working prompts for reuse
- Secure: Use local LLM for sensitive company data
Common Prompts Library
Data Import
- "Read Excel file and show first 10 rows"
- "Import CSV with custom delimiter and encoding"
- "Load multiple Excel sheets into dictionary of DataFrames"
Data Cleaning
- "Remove duplicate rows based on element_id"
- "Fill missing values with column mean"
- "Convert column to numeric, handling errors"
Data Analysis
- "Calculate descriptive statistics for numeric columns"
- "Find correlation between cost and duration"
- "Identify outliers using IQR method"
Data Export
- "Export to Excel with multiple sheets"
- "Save to CSV with specific encoding"
- "Generate formatted PDF report"
Resources
- Book: "Data-Driven Construction" by Artem Boiko, Chapter 2.3
- Website: https://datadrivenconstruction.io
- Pandas Documentation: https://pandas.pydata.org/docs/
- Ollama: https://ollama.com
- LM Studio: https://lmstudio.ai
- Google Colab: https://colab.research.google.com
Next Steps
pandas-construction-analysis
for advanced Pandas operations
See pdf-to-structured
for document processing
See etl-pipeline
for automated data pipelines
See rag-construction` for RAG implementation with construction documents