1 Optimizing Joins in a Map-Reduce Environment

合集下载

conda语法

conda语法
3. 更新包
使用conda update命令可以更新指定的包,语法如下:
conda update 包名
三、查找和删除包
1. 查找包
使用conda search命令可以查找可用的包,语法如下:
conda search 包名
2. 删除包
使用conda remove命令可以删除指定的包,语法如下:
conda remove 包名
五、其他常用命令
1. 查看已安装的包
可以使用conda list命令来查看当前环境中已安装的所有包。
2. 查看包的信息
使用conda info命令可以查看指定包的详细信息,包括版本号、依赖关系等。
3. 清理缓存
使用conda clean命令可以清理conda的缓存,释放磁盘空间。
4. 添加和管理channels
conda create -n 环境名 [包名=版本号]
2. 激活环境
创建环境后,需要激活环境才能使用其中的包。使用conda activate命令可以激活指定环境,语法如下:
conda activat以使用conda deactivate命令来切换到默认环境。
四、导出和导入环境
1. 导出环境
使用conda env export命令可以将当前环境的配置导出为一个文件,语法如下:
conda env export > 环境文件名.yml
2. 导入环境
使用conda env create命令可以根据环境文件来创建一个新的环境,语法如下:
conda env create -f 环境文件名.yml
conda语法
conda语法是指在使用conda包管理器时所使用的命令语法。conda是一个用于包管理和环境管理的开源软件,可以帮助用户轻松地管理和安装各种Python包和其他软件包。本文将介绍一些常用的conda语法,并详细解释其用法和作用。

Meraki MR66户外云管理无线局域网说明书

Meraki MR66户外云管理无线局域网说明书

The Meraki MR66 is an enterprise class, dual-concurrent 802.11n cloud managed accesspoint designed for high-density deployments in harsh outdoor locations and industrialindoor environments. The MR66 features dual-concurrent, dual-band operation andadvanced 802.11n technologies such as MIMO and beamforming, delivering the highcapacity, throughput and reliable coverage required by the most demanding businessapplications, even in harsh environments.MR66 and Meraki Cloud Management: A Powerful CombinationThe MR66 is managed via the Meraki cloud, with an intuitive browser-based interfacethat lets you get up and running quickly without training or certifications. Since theMR66 is self-configuring and managed over the web, it can even be deployed at aremote location without on-site IT staff.The MR66 is monitored 24x7 via the cloud, which delivers real-time alerts if your networkencounters problems. Remote diagnostics tools also enable real-time troubleshootingover the web.The MR66’s firmware is always kept up to date from the cloud. New features, bugfixes, and enhancements are delivered seamlessly over the web, so you never haveto manually download software updates or worry about missing security patches. Product Highlights• Ideal for outdoor and industrial indoor environments• Dual-concurrent 802.11n radios with up to 600 Mbps throughput • Point-to-point links with optional panel antennas • High performance multi-radiomesh routing• Layer 7 application fingerprintingand QoS• Built-in enterprise security, guestaccess, and NAC• Self-configuring, plug-and-playdeployment• Automatic cloud-based RF optimizationwith spectrum analysis• Real-time WIPS with Air Marshal802.11n Access PointRecommended Use CasesOutdoor coverage for high client density corporate campuses, educational institutions, and parks • Provide high-speed access to a large number of clients• Point-to-multi-point mesh Indoor coverage for industrial areas(e.g., warehouses, manufacturingfacilities)• Reliable coverage for scanner guns,security cameras, and POS devices• High speed-access for iPads, tabletsand laptopsZero-touch point-to-point• Build a long-distance bridge betweentwo networks• Extend hotspot networks via mesh whilesimultaneously serving clientsFeaturesDual enterprise class 802.11n radios, up to 600 MbpsThe MR66 features two powerful radios and advanced RF design for enhanced receive sensitivity. Combined with 802.11n technolo-gies including MIMO and beamforming, the MR66 delivers up to 600 Mbps throughput and up to 50% increased capacity compared to typical rugged enterprise-class 802.11g access points, meaning fewer access points are required for a given deployment. In addition, dual-concurrent 802.11n radios and band steering technology allow the MR66 to automatically serve legacy 802.11b/g clients using the 2.4 GHz radio and newer 802.11n clients using the 5 GHz radio, thus providing maximum speed to all clients.Rugged industrial designThe MR66 is designed and tested for salt spray, vibration, extreme thermal conditions, shock and dust and is IP67-rated, making it ideal for extreme environments. Despite its rugged design, MR66 has a low profile and is easy to deploy.Application-aware traffic shapingThe MR66 includes an integrated layer 7 packet inspection, classification, and control engine, enabling you to set QoS policies based on traffic type. Prioritize your mission critical applications, while setting limits on recreational traffic, e.g. peer-to-peer and video streaming.Automatic cloud-based RF optimization with spectrum analysisThe MR66’s sophisticated, automated RF optimization meansthat there is no need for the dedicated hardware or RF expertise typically required to tune a wireless network. An integrated spectrum analyzer monitors the airspace for neighboring WiFi devices as well as non-802.11 interference – microwave ovens, Bluetooth headsets, etc. The Meraki cloud then automatically optimizes the MR66’s chan-nel selection, transmit power, and client connection settings, provid-ing optimal performance even under challenging RF conditions. Integrated enterprise security and guest accessThe MR66 features integrated, easy-to-configure security technologies to provide secure connectivity for employees and guests alike. Advanced security features such as AES hardware-based encryption and WPA2-Enterprise authentication with 802.1X and Active Directory integration provide wire-like security with the convenience of wireless mobility. One-click guest isolation provides secure, Internet-only access for visitors. Our policy firewall (Identity Policy Manager) enables group or device-based, granular access policy control. PCI compliance reports check network settings against PCI requirements to simplify secure retail deployments. Secure wireless environments using Air MarshalMeraki wireless comes equipped with Air Marshal, a built-in wireless intrusion prevention system (WIPS) for threat detection and attack remediation. APs will scan their environment opportunistically or in real-time based on intuitive user-defined preferences. Alarms and auto-containment of malicious rogue APs are configured via flexible remediation policies, ensuring optimal security and performance in even the most challenging wireless environments.High performance meshThe MR66’s advanced mesh technologies like multi-channel routing protocols and multiple gateway support enable scalable, high throughput coverage of hard-to-wire areas with zero configuration. Mesh also improves network reliability - in the eventof a switch or cable failure, the MR66 will automatically revert to mesh mode, providing continued gateway connectivity to clients. Self-configuring, self-optimizing, self-healingWhen plugged in, the MR66 automatically connects to the Meraki cloud, downloads its configuration, and joins your network. It self optimizes, determining the ideal channel, transmit power, and client connection parameters. It also self heals, responding automatically to switch failures and other errors.Low profile, environmentally friendly designIn addition to eliminating excess packaging and documentation, 90% of the access point materials are recyclable. A maximum power draw of only 10.5 watts and a cloud-managed architecture mean that pollution, material utilization and your electric bill arekept to a minimum.SpecificationsRadioOne 802.11b/g/n and one 802.11a/n radioDual concurrent operation in 2.4 and 5 GHz bandsMax throughput rate 600 Mbit/s2.4 GHz 26 dBm peak transmission power5 GHz 24 dBm peak transmission powerMax transmission power is decreased for certain geographies to comply with local regulatory requirementsOperating bands:FCC (US) EU (Europe)2.412-2.484 GHz 2.412-2.484 GHz5.150-5.250 GHz (UNII-1) 5.470-5.600, 5.660-5.725 GHz (UNII-2)5.725 -5.825 GHz (UNII-3)802.11n Capabilities2 x 2 multiple input, multiple output (MIMO) with two spatial streamsMaximal ratio combining (MRC)BeamformingPacket aggregationCyclic shift diversity (CSD) supportPowerPower over Ethernet: 24 - 57 V (802.3af compatible)Power consumption: 10.5 W maxPower over Ethernet injector sold separatelyMountingMounts to walls and horizontal and vertical polesMounting hardware includedPhysical SecuritySecurity screw includedEnvironmentOperating temperature: -4°F to 122°F (-20°C to 50°C)IP67 environmental ratingPhysical Dimensions10.5” x 7.6” x 2.2” (267mm x 192mm x 57mm)Weight: 1.9 lb (862g)Interfaces1x 100/1000 Base-T Ethernet (RJ45) with 48V DC 802.3af PoEFour external N-type antenna connectorsSecurityIntegrated policy firewall (Identity Policy Manager)Mobile device policiesAir Marshal: Real-time WIPS (wireless intrusion prevention system) with alarmsRogue AP containmentGuest isolationTeleworker VPN with IPsecPCI compliance reportingWEP, WPA, WPA2-PSK, WPA2-Enterprise with 802.1XTKIP and AES encryptionVLAN tagging (802.1q)Quality of ServiceWireless Quality of Service (WMM/802.11e)DSCP (802.1p)Layer 7 application traffic shaping and firewallMobilityPMK and OKC credential support for fast Layer 2 roamingL3 roamingLED Indicators4 signal strength1 Ethernet connectivity1 power/booting/firmware upgrade statusRegulatoryFCC (US), IC (Canada), CE (Europe), C-Tick (Australia/New Zealand)Cofetel (Mexico), TK (Turkey)RoHSMean Time Between Failure (MTBF)450,000 hoursWarranty1 year hardware warranty with advanced replacement includedOrdering InformationMR66-HW: Meraki MR66 Cloud-Managed Dual-Radio 802.11n Ruggedized Access Point POE-INJ-3-XX: Meraki 802.3af Power over Ethernet Injector (XX = US, EU, UK or AU) ANT-10: Meraki 5/7 dBi Omni Antenna, Dual-band, N-type, Set of 2ANT-11: Meraki 14 dBi Sector Antenna, 5 GHz MIMO, N-typeANT-13: Meraki 11 dBi Sector Antenna, 2.4 GHz MIMO, N-typeNote: Meraki Enterprise license required.。

kdtree用法 -回复

kdtree用法 -回复

kdtree用法-回复kdtree用法:一步一步回答引言:Kdtree是一种用于高效地搜索k维空间中最近邻点的数据结构。

它的应用涵盖了许多领域,如模式识别、计算机图形学、机器学习等。

本文将一步一步介绍kdtree的用法,包括构建、插入、最近邻搜索以及删除等操作。

第一步:构建kdtree构建kdtree是使用kdtree的第一步。

首先,我们需要定义一个节点类来表示kdtree中的每个节点。

每个节点包含的属性有:point(用于存储k 维空间中的点)、left(指向左子树的指针)、right(指向右子树的指针)和axis(表示划分区域的坐标轴)。

接下来,我们需要定义一个递归函数来构建kdtree。

该函数将接收一个点集和当前节点作为参数。

它的基本思想是,找到当前点集中在当前坐标轴上的中位数点,作为当前节点的point,并将点集按照当前坐标轴上的中位数点分成两个子集,分别递归构建左子树和右子树。

具体的构建过程如下:1. 如果点集为空,则当前节点为空节点,返回。

2. 找到当前坐标轴上的中位数点m,将其作为当前节点的point。

3. 将点集按照m分成两个子集:小于m的点集和大于m的点集。

4. 递归构建左子树,将小于m的点集作为参数传入,并将左子树的根节点设为当前节点的left。

5. 递归构建右子树,将大于m的点集作为参数传入,并将右子树的根节点设为当前节点的right。

构建结束后,我们就得到了一个完整的kdtree。

第二步:插入点到kdtree在构建kdtree的基础上,我们可以进一步插入新的点。

插入点的过程与构建kdtree的过程类似。

首先,我们需要找到插入点在当前坐标轴上的中位数点m,然后与当前节点的point进行比较。

如果插入点小于m,则递归插入左子树;如果插入点大于等于m,则递归插入右子树。

如果当前节点的left或right为空节点,则创建一个新的节点,并将其作为当前节点的left或right。

具体的插入过程如下:1. 如果当前节点为空节点,则创建一个新的节点,将插入点作为其point,并返回。

react项目中多环境配置 -回复

react项目中多环境配置 -回复

react项目中多环境配置-回复React 项目中的多环境配置是为了解决在不同的环境中部署和运行项目时所需的配置差异问题。

在开发React 项目时,我们通常会在本地开发环境、测试环境和生产环境等不同的环境中进行测试和部署。

每个环境可能需要不同的API 地址、数据库连接信息、日志记录级别等配置。

为了便于在不同环境部署时能够轻松配置这些差异化的设置,以及便于在不同环境切换时能够快速切换配置,我们需要对React 项目进行多环境配置。

第一步:创建环境配置文件在React 项目的根目录下,我们可以创建多个环境配置文件,来分别存放不同环境所需的配置信息。

例如,我们可以创建以下三个配置文件:- .env.development:用于本地开发环境的配置- .env.test:用于测试环境的配置- .env.production:用于生产环境的配置这些配置文件的命名以`.env` 作为前缀,后面跟随环境名称。

这样的文件命名规范会被React 项目默认支持,并且在项目构建过程中会根据当前所处的环境自动加载相应的配置文件。

第二步:配置环境变量在创建好环境配置文件后,我们需要在这些文件中定义对应环境的配置信息。

环境配置文件采用键值对的形式存储配置项。

以`.env.development` 文件为例,我们可以在其中定义以下配置项:REACT_APP_API_URL=REACT_APP_LOG_LEVEL=debug这里的`REACT_APP_API_URL` 是一个自定义的配置项,用于设置API 的地址。

`REACT_APP_LOG_LEVEL` 是另一个配置项,用于设置日志记录的级别。

这样,我们可以在不同的环境配置文件中为这些配置项指定不同的值,以适应不同环境的需求。

需要注意的是,以`REACT_APP_` 开头的配置项才会被React 项目自动识别为环境变量。

这是因为Create React App 默认会将所有`REACT_APP_` 开头的环境变量注入到应用代码中。

到时差计算中并行相关算法实验及性能分析

到时差计算中并行相关算法实验及性能分析

到时差计算中并行相关算法实验及性能分析摘要:针对震动波波速成像过程中遇到的海量数据处理问题,提出了分布式实现到时差相关运算,提出了在MapReduce框架下到时差计算的程序设计思路,并在hadhoop环境下进行测试。

测试结果表明使用MapReduce作为海量传感器数据的处理框架是可行的;在进行并行的到时差相关运算时,hadoop集群运算所需时间受待计算数据量和data node个数的影响,待计算数据量越大,或data node个数越少,运算所需时间越长,但这两组关系均非线性;平均Map时间与待计算数据量和data node个数无关,仅与Map 函数的执行内容有关。

关键词:到时差;分布式矿震监测;MapReduce框架;hadoop集群;计算用时中图分类号:TP391 文献标识码:A 文章编号:2095-1302(2015)02-00-040 引言对于煤矿井下的地震勘探来说,其探测的尺度相对于一般的地震勘探来说要小得多,为了实现小尺寸地质结构的探测,传感器的布置相对来说要更密集些[1]。

随着传感器布置密度的提高,地震勘探系统采集到的数据量将随之增加,在使用单机进行处理的情况下,到时差的计算及后续的反演计算用时将随之延长[2],这对系统的实时性是极为不利的。

针对待处理数据量激增的情况,本文基于MapReduce 并行计算系统引入数据处理过程,以实际的震动数据为例,测试并分析了并行计算系统计算到时差的用时与待处理数据量、计算用时和集群节点之间的关系。

本文的主要贡献在于:(1)提出了到时差计算中相关算法的并行实现思路。

(2)测试并分析了并行相关算法的性能及影响因素,给出了进一步改进的思路。

1 背景知识与问题描述1.1 煤炭井下震动波波速成像原理震动波波速成像原理如图1所示。

当介质均匀时,可以认为震波沿直线传播,此时,可以通过测量震波到达各传感器的到时差来计算介质的平均速度[3]。

当介质不均匀时,认为震波的传播路径将按照斯奈尔定律在不同介质的分界面上发生改变,假设图1中各方格速度为v1?vn,震动波波速成像体现为寻找到一组最佳的v1?vn 组合,使得通过射线追踪方法计算得到的震波理论到时与实测到时之间的误差最小[4]。

ABB Ability Asset Manager 产品介绍说明书

ABB Ability Asset Manager 产品介绍说明书

ABB Ability ™Asset ManagerSimplicity and flexibility in asset performance management—D I S T R I B U T I O N S O L U T I O N SDocument ID 9AKK108468A3980Why Asset Management?Asset Health AssessmentABB Ability™Asset ManagerABB Ability™Supported devices and solutionsABB Ability™Edge Industrial Gateway and Cloud connectivity guide Cyber securityABB Ability Marketplace™—Why Asset Management?—Real-time operational data from equipment, delivering predictive analytics and maintenance advice that help customer make timely decisions Reduce the risk of unplannedoutages, which directly affectrevenue generationOptimize maintenance scheduleand increase work forceefficiencyEngage field service engineersonly when needed, optimizingresources and extending assetlife while maintaining safetyMake better, data-driven decisions Maximize Up-time Reduce total cost ofownershipIncrease safety andsustainability—ABB Ability™Asset Manager Business value of better performancesTake a look to our success stories Up to40%Savings onmaintenance cost (*)Up to100%Avoid unplannedlaborUp to30%Increase of thetime interval ofmaintenance(*)(*) Source: "ABB Data center case study, document 9AKK107991A1983 REV A"—Asset Health AssessmentMain circuit temperature monitoring•Continuous temp. monitoring of key hot spots •Factory integrated design, production, testingIdentifying evolving faultsInsulation gas density monitoring•Gas pressure, temperature compensation, alarm contact •Factory integrated design, production, testing Partial discharge monitoring•UHF or capacitive coupling methods•Sensors are isolated from primary circuitVideo monitoring•Visible monitoring of chassis, ES, 3PS movement•Observation window, digital camera integrated in panel Mechanical characteristics monitoring•Nonintrusive Hall sensor for coil/charging motor current •Motion sensor for CB contact travelAmbient temperature & humidity monitoring •Ambient temperature of site •Ambient humidity of site34%20%17%10%9%7%3%34% Loose connection/joints20% Environment conditions (humidity, dust, etc.)17% Incorrect work (human mistakes)10% Faulty insulation (dielectric problems)9% Faulty equipment (mechanical)7% Other 3% Overload1234561236445Failure ModesSolutionAsset health analysisVI remaining life prediction •VI contact degradation analysis•Load/short circuit breaking current•VI breaking characteristic diagram•Characteristic curve​•Model based VI remaining life prediction algorithm​Self-learning analytics of CB model specific mechanism •CB mechanical performance analysis•Contact travel•Open/close coil current•Spring charging motor currentAdaptive temperature analysis •Connection degradation analysis •Connection temperature •Current flow•Ambient temperature•3 phase temperature balanceThermalDynamic algorithm based on loadingPredictive gas insulation analysis•Gas refilling date prediction•Temperature compensated gas density•Adaptive analytics for slow and fast leakage detectionDielectric –air/solid insulation Dielectric –gas insulation MechanicalContact displacement & coil current waveformCB residual lifePartial discharge analysis •Insulation degradation analysis•PD presence indication •PD strength (Db)•PD intensity after noise removal •PD pulse rate (Hz)•Amount of PD events above noise level over time•Phase or panel fault indicationBased on key indicator monitoring and aging phenomena for predicting asset healthEmergency repair Planned repair Based on the status prediction to enable proactive operation and maintenance managementProactive operation and maintenance supports planning repair in advance, reducing risks for failure and losses significantlyAsset aging cycleFailureWarningStart runningEmergency repairResume operationPlanned repairResume operationPlanned repairHardware Software as service Service ContractsSensors & IoTFocus on thermal, electrical, mechanical and environmental aging phenomena Knowhow & algorithmAging phenomena + big data +knowhow = advanced algorithmHealth index and adviceExpert maintenance advice,based on health statusanalysisRemote SupportABB product expertremote supportOn siteSupport on site by skilledand certified techniciansTemp. Gas Env. PD ME KnowhowDataTestAlgorithmSecurityHealth IndexExpert adviceRemote tech. support On site technicalsupport—ABB Ability™Asset ManagerIt is an asset management tool for monitoring the condition and performance of the electrical system from low voltage to medium voltage.What is the ABB Ability ™Asset Manager ?It is a cloud-based platform and plug & play solution, to gather data from smart assets, monitoring devices and sensors.It is a cost-efficient solution with add-ons and premium servicesIt features a flexible user interface, providing users with pre-configured or customized dashboards and views.It provides maintenance advice , based on sophisticated data analysisKey featuresAsset management Monitoring & diagnosticsDataanalysisMaintenance advice and planning Events & notificationsReporting—DashboardUser can create her/his own dashboards, with several widgets, showing KPIs and trends dataUp to:15 different dashboards with up to 25 widgets for each of them1 Example of asset health dashboard with:1. Asset health overview2. Asset List3. Graphical site exploration4. Monitoring device connectivity overview5. Local time6. Service activities overview7. Service activities8. Events overview9. Latest Events overview10. Site explorer123456789101010—ExploreUser can browse the assets by:1. Hierarchical view2. Flat list, and search by different criteria3. Visually, with pictures, diagrams, etc.4. Connectivity list, based on system connection (gateway, data concentrators, etc.)4 213—Asset -OverviewThe Asset Overview provides critical health indicators at a glanceIs reached from dashboard widgets, hierarchical view, asset listsUsers can easily identify areas that require attention.The orange dot in the picture as well as the health details to the right, indicate an issue with abnormal temperature at the Cable A connection.—Asset -Data analysisIn Data Analysis tab the user can easily navigate to pre-defined charts for supporting failure cause analysis. Typical pre-defined charts for:•Thermal analysis (e.g., joints temperatures)•Mechanical analysis (e.g., breakeroperations)•Electrical analysis (e.g., current, voltage, etc.)The warning to the right indicates an issue with the Cable joins.The temperature trend confirms that Phase-A temperature is abnormal.A corresponding maintenance advice is provided.Asset -Partial discharge analysisAsset Manager provides valuable insights into the dielectric condition of the switchgear.Depending on the PD measurement equipment employed, trends support user analysis with:•Partial discharge presence trend•PD strength variation and increase trendsMechanical data: operation number,contact wear…Add-on predictElectrical parameters: currentflowing in steady state andprotection trip (overload., shortcircuit, earth faults…)Prerequisites :•Acquire a Predict add-on licensefor each monitored breaker•Set the environmental conditions:Temperature, humidity, dust level,vibration, …•Add the installation date and the supply voltage value•Use Ekip Connect for recording maintenance activities by a certified technician 123123Reliability curvesLast maintenance performedNext recommended maintenanceMaintenance Once the problem has been identified, the Maintenance planner is used for creating and scheduling maintenance tasks.The assigned users are automatically notified of new or upcoming activities.********************—ABB Ability™Edge Industrial Gateway—ABB Ability™Industrial EdgeGatewayBefore your data reaches the cloud, itpasses through ABB Ability™IndustrialEdge Gateway, a secure doorway andan extra layer of security.Manages how your data is collected and sharedSafeguards operational technology from external accessPrevents devices from connecting to the cloud directlyStreamlines protection to give you peace of mindFeaturesEasy installation andcommissioningVersatile Connectivity Flexible•The gateway can be DIN-r ail mounted•Commissioning is done by ABB Provisioning Tool(requires internet connectivity)•Firmware update can be done by ABB Provisioning Tool •From low to medium voltage,and vice versa•From monitoring of a singleswitchgear panel to thesupervision of multiplesubstations and sites•Provides enhancedconnectivity forWi-Fi or 3G/4G mobilenetwork services•Can manage up to the 15 RTUdevices and 45 TCP devices*)*)List of supported devices: LinkWeb pageAccess the webpage for more contents, manual and How to… videos—Supported M&D solutions—ABB Ability™Asset Manager CloudSolution architectureEdge Switchgear Computer browser Mobile browserABB Ability™Edge Industrial GatewayABB Ability™Asset ManagerData concentrator Data concentrator Data concentratorAmbient temp. & humidity ConnectiontemperatureGas tankpressureMechanicalcharacteristicsVideocapturePatrialdischarge Zigbee—ABB Ability ™Asset Manager SCADASolutions overviewCloudABB Ability ™Asset ManagerABB Ability ™Edge Industrial Gateway Complete rules and connected devices are available hereSWICOM Switchgear monitoring and diagnostic unit for M&D retrofitREF615/620 MV Relay Unisec Digital Air-insulated switchgear for Secondary DistributionUnigear Digital Air-insulated switchgear for Primary DistributionSafeRing/Safeplus Digital Gas-insulated switchgear for Secondary DistributionPrimeGear and ZX family DigitalGas-insulated switchgear for Primary DistributionEmax 2Air Circuit Breaker Ekip UPLV Digital unitMedium Voltage Switchgear Low Voltage Switchgear—ABB Ability™Asset ManagerMV Air Insulated Digital SwitchgearDigital Features with UniGear Digital 2.0SolutionMonitoring & Diagnostics1Temperature monitoringEnvironment monitoring (Temp & Humidity)in cable compartment and switch room Partial discharge monitoring (PDCOM)CB monitoring (VD4 w/Relion or VD4evo)Panel mechanical interlocksABB Ability ™asset manager 2345611123456Smart Automation & ControlRelion® protection relay with IEC 61850/Goose Current measurement –Current Sensors Voltage measurement –Voltage Sensors123123Digital Features with UnisecSolutionMonitoring & Diagnostics1Temperature monitoring Environment (Temp & Humidity)Partial discharge monitoring Gas sensor for Gsec/HySec CB mechanism monitoring ABB Ability ™Asset Manager23456Smart Automation & ControlRelion® protection relay with IEC 61850/Goose Current measurement –Current Sensors Voltage measurement –Voltage Sensors1231231451236—ABB Ability™Asset ManagerMV Gas Insulated Digital SwitchgearDigital Features with ZX DigitalSolutionMonitoring & DiagnosticsCurrently available:Remote gas density monitoring Cable connection Temperature VI Electrical Life estimation CB mechanism (Basic: Hall sensors)Env. Temp & Humidity (LVC)Smart Automation & ControlSwitchgear Controller and digital communication with IEC61850 and GOOSE/SMV Current measurement –Current SensorsVoltage measurement –Voltage Sensors (in socket) or Voltage Sensors (on cable plug)Environment (switchgear building)CB mechanism (Adv.: Angle sensors)3PS Monitoring (Electrical)PD monitoringABB Ability ™Asset Manager12345567891231123456789123Digital Features with SafePlus/SafeRingSolutionMonitoring & Diagnostics1Remote gas density monitoringCable connection temperature Environment (temp & humidity)Three position switch monitoring Partial discharge monitoringABB Ability ™Asset Manager 23456Smart Automation & ControlSwitchgear Controller and digital communication with IEC61850 and GOOSE/SMVCurrent measurement –Current SensorsVoltage measurement –Voltage Sensors (in socket) or Voltage Sensors (on cable plug)Current and Voltage measurements –Combined Bushing Sensors (KEVCY)12341123456234—ABB Ability™Asset Manager LV Digital Switchgear—Make your LV switchgear smartSeveral devices of a low voltage switchgear, like MNS andNeoGear, can be connected via Edge Industrial Gatewayto ABB Ability™Asset Manager:•Emax 2•Ekip UPDigital FeaturesACBRelay*TMS*ABB Ability™—ABB Ability™Asset Manager SWICOM—ABB Ability™Switchgear Condition Monitoring Optimized for monitoring and diagnosticretrofit installations1. Breaker monitoring through Relion relays •Utilizing existing substation automation networkinfrastructure2. Partial Discharge detection•Using ABB PDCOM sensor3. Wireless Temperature monitoring•Suitable for IEC and ANSI, AIS and GIS-(cables), ABB and non-ABB, green and brown field switchgear 4. Data visualization•SCADA•ABB Ability™Asset Manager cloud•Local HMI/Mobile APPSWICOMSystem configuration is indicative. Actual configuration will depend on configured feature set.SCADAAsset MangerIEC61850Local HMITempTempTempTempTempPDPDPD—Cyber security—ABB Ability ™Asset Manager ABB together with Microsoft, an expert partner in IT security, ABB Ability ™Asset Manager ensures state-of-the-art cybersecuritySecures your data1Designed according to potential threats analysis2Developed according to stringed security guidelines and recurring code assessments3Validated by penetration tests for verifying robustness4Built on Microsoft Azure cloud platform security technology and policies5Continuously improving for always ensuring the latest standards and innovations↓—CybersecurityABB portalCYBERSECURITY PAGE (EXTERNAL)—ABB Ability Marketplace™—ABB Ability™MarketplaceThe ABB Ability Marketplace™is aunified subscriber portal where customers can discover, subscribe, manage, and scale across ABB's ecosystem of SaaS services.Membership grants you access to the complete portfolio of subscriptions and perpetual software services and connects you directly to our industry-specialized teams, who will help you identify the best solutions and contracts for your company.—Software product typesRecurring subscriptions which are auto-renewed when purchasing from Marketplace. While purchasing an AM subscription, a Site ID will need to be entered. Subscription will be deployed automatically after the purchase. Example: Asset Manager or Energy Manager subscription (AM or EM)Vouchers are one off purchases and willactivate the services for limited time only,usually 1 year. After the expiration, the plantneeds to be reactivated either witha subscription or with another voucher.After purchasing a voucher, user will get anactivation code by email. Code needs to beredeemed within a portal.Example: Asset Manager or Energy Managersubscription (AM or EM)Base subscriptions or vouchers can beenhanced with additional functionalitieswhich are called add-ons. If add-ons arepurchased along the subscriptions, theyare recurring as a nature, if they arepurchased along the vouchers, they areone-off purchases.Example: Extra Users, LV CB HealthSubscriptions Vouchers Add-ons—ABB Ability™MarketplaceOverviewBasic Asset Manager Subscription/voucher Add-ons—Documentation and contacts—ABB Ability™Asset Manager External page for customerL I N KABB Ability ™Asset Manager ()—Get in touch。

数据倾斜?Spark3.0AQE专治各种不服

数据倾斜?Spark3.0AQE专治各种不服

数据倾斜?Spark3.0AQE专治各种不服Spark3.0已经发布半年之久,这次⼤版本的升级主要是集中在性能优化和⽂档丰富上,其中46%的优化都集中在Spark SQL上,SQL优化⾥最引⼈注意的⾮Adaptive Query Execution莫属了。

Adaptive Query Execution(AQE)是英特尔⼤数据技术团队和百度⼤数据基础架构部⼯程师在Spark 社区版本的基础上,改进并实现的⾃适应执⾏引擎。

近些年来,Spark SQL ⼀直在针对CBO 特性进⾏优化,⽽且做得⼗分成功。

CBO基本原理⾸先,我们先来介绍另⼀个基于规则优化(Rule-Based Optimization,简称RBO)的优化器,这是⼀种经验式、启发式的优化思路,优化规则都已经预先定义好,只需要将SQL往这些规则上套就可以。

简单的说,RBO就像是⼀个经验丰富的⽼司机,基本套路全都知道。

然⽽世界上有⼀种东西叫做 – 不按套路来。

与其说它不按套路来,倒不如说它本⾝并没有什么套路。

最典型的莫过于复杂Join算⼦优化,对于这些Join来说,通常有两个选择题要做:1. Join应该选择哪种算法策略来执⾏?BroadcastJoin or ShuffleHashJoin or SortMergeJoin?不同的执⾏策略对系统的资源要求不同,执⾏效率也有天壤之别,同⼀个SQL,选择到合适的策略执⾏可能只需要⼏秒钟,⽽如果没有选择到合适的执⾏策略就可能会导致系统OOM。

2. 对于雪花模型或者星型模型来讲,多表Join应该选择什么样的顺序执⾏?不同的Join顺序意味着不同的执⾏效率,⽐如A join B joinC,A、B表都很⼤,C表很⼩,那A join B很显然需要⼤量的系统资源来运算,执⾏时间必然不会短。

⽽如果使⽤A join C join B的执⾏顺序,因为C表很⼩,所以A join C会很快得到结果,⽽且结果集会很⼩,再使⽤⼩的结果集 join B,性能显⽽易见会好于前⼀种⽅案。

蚁群算法解决任务调度问题-Python

蚁群算法解决任务调度问题-Python

蚁群算法解决任务调度问题-Python 蚁群算法是⼀种启发式优化算法,也是⼀种智能算法、进化计算。

和遗传算法、粒⼦群算法相⽐,蚁群算法所优化的内容是拓扑序(或者路径)的信息素浓度,⽽遗传算法、粒⼦群算法优化的是某⼀个个体(解向量)。

例如TSP问题,30个城市之间有900个对应关系,30*15/2=435条路径,在蚂蚁经过之后,都留下了信息素,⽽某⼀个拓扑序指的是⼀个解向量,旅⾏商的回路应是⾸尾相连的30条路径,蚂蚁在⾛路的时候会考虑到能⾛的路径上的信息素浓度,然后选择⼀个拓扑序作为新解。

例如任务车间调度问题,假如有8个机器和10个任务,则⼀共有80个对应关系,每个对应关系看作⼀个路径,会有信息素的标记,⽽解向量是⼀个长度为10的向量,蚂蚁在⾛路的时候,会选择每⼀个任务到所有机器的对应关系,考虑信息素的浓度选择其中⼀个,由此选择10个任务各⾃放在哪⼀个机器上,作为新解。

蚁群算法天⽣适合解决路径问题、指派问题,效果通常⽐粒⼦群等要好。

相⽐于模拟退⽕算法,计算更稳定。

相⽐于粒⼦群算法收敛性更好。

对于任务调度问题,需要定义模型对象: 节点和云任务。

节点VM具有节点编号id、两种资源总量和两种资源的已占有量。

因此其剩余空间是总量capacity减去已占有量supply。

class Cloudlet:def__init__(self, cpu_demand: float, mem_demand: float):self.cpu_demand = cpu_demandself.mem_demand = mem_demandclass VM:def__init__(self, vm_id: int, cpu_supply: float, cpu_velocity: float, mem_supply: float, mem_capacity: float):self.id = vm_idself.cpu_supply = cpu_supplyself.cpu_velocity = cpu_velocityself.mem_supply = mem_supplyself.mem_capacity = mem_capacity然后定义蚁群算法解决任务调度问题的类。

ran out of retrieving query _results -回复

ran out of retrieving query _results -回复

ran out of retrieving query _results -回复"ran out of retrieving query results" refers to a situation where a computer program or system fails to retrieve or access the desired information or data. This problem can occur due to various reasons such as technical glitches, database errors, inadequate resources, or programming issues. In this article, we will explore the reasons behind running out of retrieving query results and the steps to resolve this issue effectively.Introduction:In the modern era, where the volume of data is growing exponentially, the ability to retrieve and analyze information is crucial for businesses and individuals. However, there are instances when our systems fail to retrieve the desired query results, causing frustration and hindering productivity. Understanding the root causes behind this issue can help us take appropriate steps to resolve it.Reasons behind running out of retrieving query results:1. Insufficient memory or storage: One common reason for running out of retrieving query results is a lack of memory or storage capacity. When a system's memory is full or near its limit, it fails toprocess and retrieve queries effectively, leading to no results or partial results.2. Network issues: A slow or unreliable network connection can also result in running out of retrieving query results. When the network is weak, the system takes longer to fetch data, making the retrieval process inefficient.3. Database overload: If the database that stores the query results becomes overloaded with data or lacks proper indexing, it might struggle to retrieve the requested information. This can lead to incomplete or no results being returned to the user.4. Inadequate system resources: Insufficient processing power, CPU usage, or disk input/output (I/O) capacity can significantly impact the retrieval process. When the system's resources are limited or poorly allocated, it may struggle to fetch and display query results within a reasonable timeframe.Steps to resolve the issue:1. Check available memory and storage: Begin by analyzing the memory and storage capacity of the system. Ensure that sufficientmemory is available for query processing and that there is enough storage to store and retrieve query results. If memory or storage is low, consider upgrading or optimizing these resources.2. Optimize network connectivity: Evaluate the network connectivity and address any issues affecting its speed and reliability. Check for any bandwidth limitations, network congestion, or hardware problems that may hinder the retrieval of query results. Implement measures to improve network performance, such as upgrading hardware, modifying network configurations, or using caching mechanisms.3. Optimize the database: Analyze the database infrastructure and optimize it for query retrieval. Ensure proper indexing is implemented, as it can significantly improve the speed and efficiency of queries. Consider partitioning the database, archiving old data, or utilizing database sharding techniques to distribute the load and enhance retrieval performance.4. Allocate sufficient system resources: Monitor system resource utilization, including CPU usage, disk I/O, and network bandwidth. If any resource is consistently approaching its limit during queryretrieval processes, consider allocating additional resources or optimizing existing resource allocation strategies. This may involve upgrading hardware, optimizing software configurations, or utilizing load balancers or parallel processing techniques.5. Review query design and execution: Examine the queries being sent to the system and their execution plans. Poorly designed or inefficient queries can consume excessive resources and result in running out of retrieving query results. Review and optimize query structures, joins, and filtering conditions to make them more efficient. Consider seeking expert help from database administrators or query optimization tools to improve query performance.6. Implement caching and data retrieval strategies: Utilize caching mechanisms to store frequently accessed query results. Caching can significantly enhance retrieval speed and reduce the strain on system resources. Consider implementing server-side or client-side caching techniques, depending on the nature of the application and its requirements.Conclusion:Running out of retrieving query results can be a frustrating and detrimental issue for effective data analysis and decision-making. By understanding the reasons behind this problem and following the steps mentioned above, individuals and organizations can mitigate this issue and ensure efficient retrieval of query results. Regular monitoring, optimization, and periodic system maintenance are essential to maintain an optimal data retrieval experience.。

hive的join原理

hive的join原理

hive的join原理Hive join is a method used in Hive, which is a data warehouse infrastructure that provides data summarization, query, and analysis. Hive join allows to combine records from two or more tables in a database. Hive join provides a way to merge data from different sources based on a related column between the tables.Hive的join是Hive中使用的一种方法,Hive是一个提供数据汇总、查询和分析的数据仓库基础架构。

Hive join允许将来自数据库中两个或更多表的记录合并在一起。

Hive join提供了一种根据表之间的相关列合并来自不同来源的数据的方法。

One perspective of understanding the join principle in Hive is to consider the different types of joins that can be performed. Hive supports various types of joins including inner join, outer join, left outer join, right outer join, and full outer join. Each of these types of joins has its own specific way of merging the data from the tables based on the specified conditions.理解Hive中join原理的一个角度是考虑可以执行的不同类型的join。

Spark编程基础:Spark SQL单元测验与答案

Spark编程基础:Spark SQL单元测验与答案

一、单选题1、关于Shark,下面描述正确的是()。

A.Shark提供了类似Pig的功能B.Shark把SQL语句转换成MapReduce作业C.Shark重用了Hive中的HiveQL解析、逻辑执行计划翻译、执行计划优化等逻辑D.Shark的性能比Hive差很多正确答案:C2、下面关于Spark SQL架构的描述错误的是()。

A.在Shark原有的架构上重写了逻辑执行计划的优化部分,解决了Shark存在的问题B. Spark SQL在Hive兼容层面仅依赖HiveQL解析和Hive元数据C.Spark SQL执行计划生成和优化都由Catalyst(函数式关系查询优化框架)负责D.Spark SQL执行计划生成和优化需要依赖Hive来完成正确答案:D3、要把一个DataFrame保存到people.json文件中,下面语句哪个是正确的。

()A.df.write.json("people.json")B. df.json("people.json")C.df.write.format("csv").save("people.json")D.df.write.csv("people.json")正确答案:A4、以下操作中,哪个不是DataFrame的常用操作。

()A.printSchema()B.select()C.filter()D.sendto()正确答案:D二、多选题1、Shark的设计导致了两个问题。

()A.执行计划优化完全依赖于Hive,不方便添加新的优化策略B.执行计划优化不依赖于Hive,方便添加新的优化策略C.Spark是线程级并行,而MapReduce是进程级并行,因此,Spark在兼容Hive的实现上存在线程安全问题,导致Shark不得不使用另外一套独立维护的、打了补丁的Hive源码分支D.Spark是进程级并行,而MapReduce是线程级并行,因此,Spark在兼容Hive的实现上存在线程安全问题,导致Shark不得不使用另外一套独立维护的、打了补丁的Hive源码分支正确答案:A、C2、下面关于为什么推出Spark SQL的原因的描述正确的是()。

pathjoinsubstitution函数 -回复

pathjoinsubstitution函数 -回复

pathjoinsubstitution函数-回复题目:一步一步解析Python中的path.join和substitution函数引言:Python是一种常用的编程语言,提供了丰富的标准库,为开发人员提供了方便和效率。

在Python的标准库中,path模块是处理文件和目录路径的重要模块之一。

其中的path.join和substitution函数在文件路径处理中相当有用。

本文将详细解析这两个函数的用法和作用。

一、path.join函数及其作用path.join函数是Python标准库中的一个函数,用于连接多个路径组件。

其作用是将给定的路径组件连接起来,并返回一个路径字符串。

join函数有两个形式,分别是os.path.join(*path)和os.path.join(path1,path2, ...),前者接受一个参数元组,后者接受多个参数。

下面我们将一步一步解析这两个形式的用法。

1. os.path.join(*path)形式在这种形式下,path参数以元组的形式传入,可以传入任意多个路径组件。

join函数通过使用系统的路径分隔符将各个组件连接起来,并返回一个合法的路径字符串。

例如,我们有以下代码片段:pythonimport ospath_tuple = ("dir", "subdir", "file.txt")full_path = os.path.join(*path_tuple)print(full_path)运行上述代码,输出结果如下:dir/subdir/file.txt在以上示例中,元组"path_tuple"包含了三个组件,join函数将这三个组件按照路径结构连接起来,并返回完整的路径字符串。

2. os.path.join(path1, path2, ...)形式在这种形式中,join函数通过多个参数的方式接受路径组件。

conda命令行参数运行原理

conda命令行参数运行原理

在使用Conda时,可以通过命令行参数来执行各种操作。

Conda是一个用于包管理和环境管理的工具,它允许用户创建、导出、安装、删除和更新环境中的软件包。

以下是一些常见的Conda命令行参数及其运行原理的简要说明:1. 创建环境:bashconda create --name myenv- 原理:上述命令将创建一个名为myenv的新环境。

Conda会根据指定的名称创建一个新的环境,并默认安装基本的软件包。

2. 安装软件包:bashconda install numpy- 原理:这个命令会安装numpy包及其依赖项。

Conda会检查包的依赖关系,确保它们满足环境的要求,并下载/安装这些包。

3. 更新软件包:bashconda update numpy- 原理:更新numpy包及其依赖项。

Conda会检查已安装的包的版本,然后下载并安装更新版本。

4. 列出已安装的环境:bashconda env list- 原理:列出所有已安装的Conda环境。

这将显示每个环境的名称、路径和当前活动环境。

5. 激活环境:bashconda activate myenv- 原理:激活名为myenv的环境。

激活环境会改变命令行提示符,表示你正在使用特定的Conda环境。

6. 导出环境配置:bashconda env export > environment.yml- 原理:将当前环境的配置导出到一个YAML文件中,以便在其他地方重建相同的环境。

这些是一些基本的Conda命令行参数和它们的运行原理。

Conda使用一个中央的包索引,允许用户轻松管理包和环境。

Conda还负责解决依赖关系,确保软件包之间的兼容性,以及在不同环境中的隔离。

purepath 用法 -回复

purepath 用法 -回复

purepath 用法-回复PurePath 是一种用于分析和追踪应用程序性能的技术和工具。

它提供了实时的、端到端的响应时间和事务跟踪,可以用于优化应用程序性能、改进用户体验以及诊断和解决问题。

在本文中,我们将深入探讨PurePath 的用法。

第一步:理解PurePath 的概念PurePath 是Dynatrace 公司开发的一种技术,用于追踪应用程序中的事务和性能问题。

它通过捕获和分析实际用户会话的每一步交互和系统调用,提供了详细的可视化和分析,使开发人员和运维团队能够深入了解应用程序的性能特征和瓶颈。

PurePath 不仅提供了关键的性能指标,如响应时间和吞吐量,还可以跟踪每一次事务调用所经历的路径,包括数据库查询、外部服务调用和应用程序内部方法调用等。

这使得开发人员能够识别瓶颈和潜在问题,并采取相应的措施来改进应用程序的性能。

第二步:安装和配置PurePath要使用PurePath,首先需要安装Dynatrace 的监控代理。

监控代理是负责捕获和分析应用程序性能数据的组件。

它可以集成到应用程序的代码中,或者通过网络连接到应用程序的运行环境。

安装和配置监控代理通常是一个相对简单的过程。

一般来说,你需要在应用程序的运行环境中设置监控代理的连接参数,并确保的代理能够与Dynatrace 的监控服务器进行通信。

一旦设置完成,监控代理将开始捕获和报告关于应用程序性能的数据。

第三步:使用PurePath 进行应用程序性能分析一旦监控代理开始工作,PurePath 就会记录应用程序的所有事务和系统调用。

这些数据将被发送到Dynatrace 的监控服务器进行分析和报告。

使用PurePath 进行应用程序性能分析通常包括以下步骤:- 导航到Dynatrace 的监控服务器。

在监控服务器上,你可以找到一个仪表盘,用于查看和管理应用程序的性能数据。

- 在仪表盘上,选择要分析的应用程序。

你可以根据应用程序的名称、版本或标签进行过滤。

Spark中的join方式(pySpark)

Spark中的join方式(pySpark)

Spark中的join⽅式(pySpark)⽆论是mapreduce还是spark ,分布式框架的性能优化⽅向⼤致分为:负载均衡、⽹络传输和磁盘I/O 这三块。

⽽spark是基于内存的计算框架,因此在编写应⽤时需要充分利⽤其内存计算特征。

本篇主要针对spark应⽤中的join问题进⾏讨论,关于集群参数的优化会在另⼀篇⽂章中提及。

在传统的数据库平台和分布式计算平台,join的性能消耗都是很可观的,对spark来说如果join的表⽐较⼤,那么在shuffle时⽹络及磁盘压⼒会明显提升,严重时可能会造成excutor失败导致任务⽆法进⾏下去,对这种join的优化⽅法主要是采⽤map和filter来改变join的实现⽅式,减少shuffle阶段的⽹络和磁盘I/O。

下⾯以表的数据量⼤⼩分两部分来讨论。

⼤表:数据量较⼤的表⼩表:数据量较⼩的表⼀、⼤表与⼩表之间的join这种join是⼤部分业务场景的主要join⽅式,将⼩表以broadcast的形式分发到每个executor后对⼤表进⾏filter操作,以下对每种join进⾏⽰例说明(兼容表中ID不唯⼀的情况)。

1、leftOuterJoin>>>d1=sc.parallelize([(1,2),(2,3),(2,4),(3,4)])>>>d2=sc.parallelize([(1,'a'),(2,'b'),(1,'d'),(5,'2')])原⽣实现⽅式:>>>d1.leftOuterJoin(d2).collect()>>>[(1, (2, 'a')), (1, (2, 'd')), (2, (4, 'b')), (2, (3, 'b')), (3, (4, None))]map实现⽅式(⼩表在右的实现⽅式,⼩表在左的情况会稍微复杂些,需要多⼀些操作操作,实际场景中不多见):def doJoin(row):result=[]if row[1][1] is not None:for i in row[1][1]:result+=[(row[0],(row[1][0],i))] else:result+=[row] return resultd2_map={}for i in d2.groupByKey().collect():d2_map[i[0]]=i[1]d2_broadcast=sc.broadcast(d2_map)d2_dict=d2_broadcast.valued1.map(lambda row:(row[0],(row[1],d2_dict.get(row[0])))).flatMap(doJoin).collect()>>>[(1, (2, 'd')), (1, (2, 'a')), (2, (3, 'b')), (2, (4, 'b')), (3, (4, None))]2、join这⾥的join指的是innerjoin即只取出匹配到的数据项,只需要在上⾯的实现⽅式中加个filter即可d1.map(lambda row:(row[0],(row[1],d2_dict.get(row[0])))).filter(lambda row:row[1][1] is not None).flatMap(doJoin).collect()>>>[(1, (2, 'd')), (1, (2, 'a')), (2, (3, 'b')), (2, (4, 'b'))]⼆、⼤表与⼤表之间的join(Reduce-join)⼤表之间的join⽆法通过缓存数据来达到优化⽬的,因此需要把优化的重点放在分区效率及key的设计上1、join的key值尽量使⽤数值类型,减少分区及shuffle的操作时间,在join时数值类型的key值在匹配时更快2、将过滤条件放在join之前,使得join的数据量尽量最少3、在join之前将两个表按相同分区数进⾏重新分区reduce-join:指将两个表按key值进⾏分区,相同key的数据会被分在同⼀个分区,最后使⽤mapPartition进⾏join操作。

图神经网络GNN基本知识

图神经网络GNN基本知识

图神经网络GNN基本知识掌握图神经网络GNN基本,看这篇文章就够了 3 图 (Graph) 3 图神经网络 3 DeepWalk:第一个无监督学习节点嵌入的算法 4 GraphSage:学习每个节点的嵌入 6 总结 7 使用图神经网络(GNN)寻找最短路径 8 引言 8 人工智能中“图神经网络GNN”如何理解?(附斯坦福综述)19 DeepMind、谷歌大脑、MIT等机构联合提出“图网络”(GNN),将端到端学习与归纳推理相结合,有望解决深度学习无法进行关系推理的问题。

19 掌握图神经网络GNN基本,看这篇文章就够了【新智元导读】图神经网络(GNN)在各个领域越来越受欢迎,本文介绍了图神经网络的基本知识,以及两种更高级的算法:DeepWalk和GraphSage。

最近,图神经网络 (GNN) 在各个领域越来越受到欢迎,包括社交网络、知识图谱、推荐系统,甚至生命科学。

GNN 在对图形中节点间的依赖关系进行建模方面能力强大,使得图分析相关的研究领域取得了突破性进展。

本文旨在介绍图神经网络的基本知识,以及两种更高级的算法:DeepWalk和 GraphSage。

图 (Graph) 在讨论 GNN 之前,让我们先了解一下什么是图 (Graph)。

在计算机科学中,图是由两个部件组成的一种数据结构:顶点 (vertices) 和边 (edges)。

一个图 G可以用它包含的顶点 V 和边 E 的集合来描述。

边可以是有向的或无向的,这取决于顶点之间是否存在方向依赖关系。

一个有向的图 (wiki) 顶点通常也被称为节点 (nodes)。

在本文中,这两个术语是可以互换的。

图神经网络图神经网络是一种直接在图结构上运行的神经网络。

GNN 的一个典型应用是节点分类。

本质上,图中的每个节点都与一个标签相关联,我们的目的是预测没有ground-truth 的节点的标签。

本节将描述 The graph neural network model (Scarselli, F., et al., 2009) [1] 这篇论文中的算法,这是第一次提出 GNN 的论文,因此通常被认为是原始 GNN。

pathjoinsubstitution函数 -回复

pathjoinsubstitution函数 -回复

pathjoinsubstitution函数-回复"path.join"函数是一种用于在操作系统中的文件路径中进行连接的方法。

在这篇文章中,我们将一步一步回答有关路径连接与substitution的内容,了解这些概念的重要性,以及在实际编程中如何使用它们来提高代码质量和可读性。

在计算机编程中,路径是指向文件或文件夹的地址。

在不同的操作系统中,路径的格式可能有所不同。

例如,在Windows操作系统中,路径使用反斜杠(\)来分隔文件夹,而在UNIX和类UNIX系统中,路径使用正斜杠(/)来分隔文件夹。

"path.join"函数是一种用于连接路径的方法。

它可以被用作一个模块来导入和使用,以便在代码中进行路径连结。

这个函数接收任意数量的参数,并且返回一个连接后的路径。

让我们来看一个简单的示例来理解这个函数的用法。

假设我们有两个变量,一个是文件夹路径(folder_path),另一个是文件名(file_name):folder_path = "/home/user/documents"file_name = "report.txt"现在,我们想连接这两个变量来得到一个完整的文件路径。

我们可以使用"path.join"函数来实现这个目标。

下面是用法的示例:import osfull_path = os.path.join(folder_path, file_name)在这个例子中,我们首先导入了OS模块,该模块提供了许多与操作系统相关的功能。

然后,我们使用"path.join"函数将“folder_path”和“file_name”变量连接起来,并将结果保存在“full_path”变量中。

这种方法的优势是,它可以根据操作系统自动选择适当的路径分隔符。

无论是在Windows还是UNIX系统中,代码都会正确地工作。

conda 基本命令

conda 基本命令

Conda是一个开源的软件包管理系统和环境管理系统。

下面是一些常用的conda命令:创建环境:conda create --name myenv:创建一个名为myenv的环境。

conda create -n myenv python=3.8:创建一个名为myenv的环境,并指定Python版本为3.8。

conda create -n myenv python=3.8 numpy pandas:创建一个名为myenv的环境,并安装numpy和pandas包。

激活和退出环境:conda activate myenv:激活名为myenv的环境。

conda deactivate:退出当前环境。

安装和卸载包:conda install numpy:安装numpy包。

conda install numpy pandas:同时安装numpy和pandas包。

conda install -c conda-forge tensorflow:从conda-forge渠道安装tensorflow包。

conda remove numpy:卸载numpy包。

查看已安装的包:conda list:列出当前环境中已安装的所有包。

导出和导入环境:conda env export > environment.yml:将当前环境的配置导出到environment.yml文件。

conda env create -f environment.yml:根据environment.yml文件创建一个新的环境。

这些只是conda的一些基本命令,还有很多其他功能和选项可以根据需要使用。

可以通过运行conda --help命令或者查阅conda文档来获得更详细的信息。

pycharm一直索引或索引过大占用系统盘问题

pycharm一直索引或索引过大占用系统盘问题

pycharm⼀直索引或索引过⼤占⽤系统盘问题现象:⽤pycharm做深度学习,越⽤越慢,⽽且很容易内存溢出,随着使⽤次数增加,pyCharm索引时间越来越长1)程序代码并没有激增,程序还没跑起来,就占⽤电脑⼤量内存;2)右键查看函数,跳转等涉及索引操作⾮常慢;3)跑⽹上⽰例程序,下载的代码⽂件并不⼤。

但是指定的图像数据下载路径,为项⽬路径;4)查不到跟我相似问题。

解决过程:因为,代码量没剧增。

只有图⽚数据在增长。

于是将图⽚数据更改到项⽬外的⽂件夹,并且程序调⽤图像数据路径都更改为项⽬外⽂件夹。

更改后内存占⽤从5G还不能满⾜-》600M;索引速度从2个⼩时还不能满⾜降低到⼗⼏秒。

恢复正常状态。

成因解析:最初原因:1)⽅便代码索引数据,所以代码⾥⾯使⽤相对路径。

因此,数据直接放置在⽬录下。

刚开始研究深度学习,数据都⽐较⼩。

随着学习深⼊,数据规模越来越⼤,⽐如变分⽣成⽹络(DCGAN)。

原始数据200M,但是中间⽣成的训练图⽚达到12G。

2)pycharm会默认将项⽬数据都遍历,期间内存会增加,检索时间根据项⽬⼤⼩改变。

-----------------------------------------------最终解决办法:将除代码库之外数据全部迁移出项⽬。

避免pycharm对图⽚数据进⾏索引,费时且⽆⽤。

----------------------------------------------将数据迁移出去是个办法。

但是拷贝代码等情况,会使得代码迁移⿇烦。

所以,⽹上还有⼀段被忽略信息:“将不想索引的⽂件夹设置为Excluded Folders即可”,设置为不索引。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Optimizing Joins in a Map-Reduce EnvironmentFoto AfratiNational Technical University of Athens,Greece afrati@softlab.ece.ntua.grJeffrey D.UllmanStanford University,USA ullman@ABSTRACTImplementations of map-reduce are being used to perform many operations on very large data.We explore alternative ways that a system could use the environment and capa-bilities of map-reduce implementations such as Hadoop.In particular,we look at strategies for combining the natural join of several relations.The general strategy we employ is to identify certain attributes of the multiway join that are part of the”map-key,”an identifier for a particular Reduce process to which the Map processes send tuples.Each at-tribute of the map-key gets a”share,”which is the number of buckets into which its values are hashed,to form a com-ponent of the identifier of a Reduce process.Relations have their tuples replicated in limited fashion,the degree of repli-cation depending on the shares for those map-key attributes that are missing from their schema.We study the problem of optimizing the shares,given afixed product(i.e.,afixed number of Reduce processes).An algorithm for detecting andfixing problems where a variable is mistakenly included in the map-key is given.Then,we consider two important special cases:chain joins and star joins.In each case we are able to determine the map-key and determine the shares that yield the least amount of replication.1.INTRODUCTION AND MOTIV ATION Search engines and other data-intensive applications have large amounts of data needing special-purpose computa-tions.The canonical problem today is the sparse-matrix-vector calculation involved with PageRank[2],where the dimension of the matrix and vector can be in the10’s of billions.Most of these computations are conceptually sim-ple,but their size has led implementors to distribute them across hundreds or thousands of low-end machines.This problem,and others like it,led to a new software stack to take the place offile systems,operating systems,and database-management systems.Central to this stack is afile system such as the Google File System(GFS)[8]or Hadoop Distributed File System Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment.To copy otherwise,or to republish,to post on servers or to redistribute to lists,requires a fee and/or special permission from the publisher,ACM.VLDB‘09,August24-28,2009,Lyon,FranceCopyright2009VLDB Endowment,ACM000-0-00000-000-0/00/00.(HDFS)[1].Suchfile systems are characterized by:•Block sizes that are perhaps1000times larger than those in conventionalfile systems—multimegabyte instead of multikilobyte.•Replication of blocks in relatively independent loca-tions(e.g.,on different racks)to increase availability.A powerful tool for building applications on such afile system is Google’s map-reduce[6]or its open-source equiva-lent Hadoop[1].Briefly,map-reduce allows a Map function to be applied to data stored in one or morefiles,resulting in key-value pairs.Many instantiations of the Map function can operate at once,and all their produced pairs are routed by a master controller to one or more Reduce processes,so that all pairs with the same key wind up at the same Re-duce process.The Reduce processes apply another function to combine the values associated with one key to produce a single result for that key.Map-reduce,inspired from functional programming,is a natural way to implement sparse-matrix-vector multiplica-tion in parallel,and we shall soon see an example of how it can be used to compute parallel joins.Further,map-reduce offers resilience to hardware failures,which can be expected to occur during a massive calculation.The master controller manages Map and Reduce processes and is able to redo them if a process fails.The new software stack includes higher-level,more data-base-like facilities,as well.Examples are Google’s BigTable [3],or Yahoo!’s PNUTS[5],which can be thought of ad-vancedfile-level facilities.At a still higher level,Yahoo!’s PIG/PigLatin[10]translates relational operations such as joins into map-reduce computations.The work in[4]sug-gests adding to map-reduce a“merge”phase and demon-strates how this can express relational algebra operators. 1.1A Model for Cluster ComputingThe same environment in which map-reduce proves so use-ful can also support interesting algorithms that do notfit the map-reduce form.Clustera[7]is an example of a system that allows moreflexible programming than does Hadoop, in the samefile environment.Although most of this pa-per is devoted to new algorithms that dofit the map-reduce framework,we shall suggest in Section1.4how one could take advantage of more general computation plans.Here are the elements that describe the environment in which computations like map-reduce can take place.1.Files:Afile is a set of tuples.It is stored in afilesystem such as GFS,that is,replicated and with avery large block size.Unusual assumptions aboutfiles are:(a)We assume the order of tuples in afile cannot bepredicted.Thus,thesefiles are really relations asin a relational DBMS.(b)Many processes can read afile in parallel.Thatassumption is justified by the fact that all blocksare replicated and so several copies can be readat once.(c)Many processes can write pieces of afile at thesame time.The justification is that tuples of thefile can appear in any order,so several processescan write into the same buffer,or into severalbuffers,and thence into thefile.2.Processes:A process is the conventional unit of com-putation.It may obtain input from one or morefiles and write output to one or morefiles.3.Processors:These are conventional nodes with a CPU,main memory,and secondary storage.We do not as-sume that the processors hold particularfiles or com-ponents offiles.There is an essentially infinite supply of processors.Any process can be assigned to any one processor.1.2The Cost Measure for AlgorithmsAn algorithm in our model is an acyclic graph of processes with an arc from process P1to process P2if P1generates output that is(part of)the input to P2.It is subject to the constraint that a process cannot begin until all of its input has been created.Note that we assume an infinite supply of processors,so any process can begin as soon as its input is ready.•The communication cost of a process is the size of the input to this process.Note that we do not count the output size for a process.The output must be input to at least one other process(and will be counted there), unless it is output of the algorithm as a whole.We cannot do anything about the size of the result of an algorithm anyway.But more importantly,the algo-rithms we deal with are query implementations.The output of a query that is much larger than its input is not likely to be useful.Even analytic queries,while they may involve joining large relations,usually end by aggregating the output so it is meaningful to the querier.•The total communication cost is the sum of the com-munication costs of all processes that constitute an algorithm.•The elapsed communication cost is defined on the acyc-lic graph of processes.Consider a path through this graph,and sum the communication costs of the pro-cesses along that path.The maximum sum,over all paths,is the elapsed communication cost.In our analysis,we do not account for the computation time taken by the processors.Typically,processing at a compute node can be done in main memory,if we are care-ful to assign limited amounts of work to each process.Thus,the cost of reading data from disk and shipping it over a net-work such as gigabit Ethernet will tend to dominate the total elapsed time.Even in situations such as we shall explore, where a process involves joining several relations,we shall assume that tricks such as semijoins and judicious ordering can bring the processing cost down so it is at most com-mensurate with the cost of shipping data to the processor. The technique of Jakobsson[?]for chain joins,involving early duplicate elimination,would also be very important for multiway joins such those that follow paths in the graph of the Web.1.3Outline of Paper and Our Contributions In this paper,we begin an investigation into optimization issues for algorithms implemented in the environment just described.In particular,we are interested in algorithms that minimize the total communication cost.Our contributions are the following:1.Section1.4is an example of a real problem—compu-tation of“hubs and authorities”—for which the ap-propriate algorithm does not have the standard map-reduce form.This example also serves to motivate the advantages of the multiway join,the study of which forms the largest portion of this paper.2.In Section2,we begin the study of multiway(natural)joins.For comparison,we review the“normal”way to compute(2-way)joins using map-reduce.Through ex-amples,we sketch an algorithm for multiway join eval-uation that optimizes the communication cost by se-lecting properly those attributes that are used to par-tition and replicate the data among Reduce processes;the selected attributes form the“map-key.”We also show that there are realistic situations in which the multiway join is more efficient than the conventional cascade of binary joins.3.In Section2.4we introduce the notion of a“share”for each attribute of the map-key.The product of the shares is afixed constant k,which is the number of Reduce processes we shall use to implement the join.It will turn out that each relation in a multiway join is replicated as many times as the product of the shares of the map-key attributes that are not in the schema for that relation.4.The heart of the paper explores how to choose themap-key and shares to minimize the communication cost.•The method of“Lagrangean multipliers”lets usset up the communication-cost-optimization prob-lem under the constraint that the product of theshare variables is a constant k.There is an im-plicit constraint on the share variables that eachmust be a positive integer.However,optimiza-tion techniques such as Lagrange’s do not supportsuch constraints directly.Rather,they serve onlyto identify points(values for all the share vari-ables)at which minima and maxima occur.Evenif we postpone the matter of rounding or other-wise adjusting the share variables to be positiveintegers,we must still consider both minima thatare identified by Lagrange’s method by having allderivatives with respect to each of the share vari-ables equal to0,and points lying on the boundaryof the region defined by requiring each share vari-able to be at least1.•In the common,simple case,we simply set upthe Lagrangean equations and solve them tofinda minimum in the positive orthant(region withall share variables nonnegative).If some of theshare variables are less than1,we can set themto1,their minimum possible value,and removethem from the map-key.We then resolve the op-timization problem for the smaller set of map-keyattributes.•Unfortunately,there are cases where the solu-tion to the Lagrangean equations implies that ata minimum,one or more share variables are0.What that actually means is that to attain a min-imum in the positive orthant under the constraintof afixed product of share variables,certain vari-ables must approach0,while other variables ap-proach infinity,in a way that the product of allthese variables remains afixed constant.Sec-tion3explores this problem.We begin in Sec-tion3.2by identifying“dominated”attributes,which can be shown never to belong in a map-key,and which explain most of the cases wherethe Lagrangean yields no solution within the pos-itive orthant.•But dominated attributes in the map-key are notresponsible for all such failures.Section3.4han-dles these rare but possible cases.We show that itis possible to remove attributes from the map-keyuntil the remaining attributes allow us to solvethe equations,although the process of removingattributes can be exponential in the number ofattributes.•Finally,in Section3.5we are able to put all of theabove ideas together.We offer an algorithm forfinding the optimal values of the share variablesfor any natural join.5.Section4examines two common kinds of joins:chainjoins and star joins(joins of a large fact table with sev-eral smaller dimension tables).For each of these types of joins we give closed-form solutions to the question of the optimal share of the map-key for each attribute.•In the case of star joins,the solution not only tellsus how to compute the join in a map-reduce-typeenvironment.It also suggests how one could opti-mize storage by partitioning the fact table perma-nently among all compute nodes and replicatingeach dimension table among a small subset of thecompute nodes.1.4A Motivating ExampleBefore proceeding,we shall take up a real problem and discuss how it might be implemented in a map-reduce-like environment.While much of this paper is devoted to algo-rithms that can be implemented in the map-reduce frame-work,the problem we discuss here can profit considerably from going outside map-reduce,while still exploiting the computation environment in which map-reduce operates.The problem we shall address is computing one step in the HITS iteration[9].In HITS,or“hubs and authorities,”one computes two scores—the hub score and the authority score—for each Web page.Intuitively,good hubs are pages that link to many good authorities,and good authorities are pages linked to by many good hubs.We shall concentrate on computing the authority score, from which the hub score can be computed easily(or vice-versa).We must do an iteration,where a vector that esti-mates the authority of each page is used,along with the inci-dence matrix of the Web,to compute a better estimate.The authority estimate will be represented by a relation A(P,S), where A(p,s)means that the estimated authority score of page p is s.The incidence matrix of the Web will be rep-resented by a relation M(X,Y),containing those pairs of pages x and y such that x has one or more links to y.To compute the next estimate of the authority of any page p, we:1.Estimate the hub score of every page q to be the sumover all pages r that q links to,of the current authority score of r.2.Estimate the authority of page p to be the sum of theestimated hub score for every page q that links to p.3.Normalize the authorities byfinding the largest au-thority score m and dividing all authorities by m.This step is essential,or as we iterate,authority scores will grow beyond any bound.In this manner,we keep1 as the maximum authority score,and the ratio of the authorities of any two pages is not affected by the nor-malization.We can express these operations in SQL easily.Thefirst two steps are implemented by the SQL querySELECT m1.Y,SUM(A.S)FROM A,M m1,M m2WHERE A.P=m2.Y AND m1.X=m2.XGROUP BY m1.YThat is,we perform a3-way join between A and two copies of M,do a bag projection onto two of the attributes,and then group and aggregate.Step(3)is implemented byfinding the maximum second component in the relation produced by the query above,and then dividing all the second components by this value.We could implement the3-way join by two2-way joins, each implemented by map-reduce,as in PIG.The maximum and division each could be done with another map-reduce. However,as we shall see,it is possible to do the3-way join as a single map-reduce operation.1Further,parts of the projection,grouping,and aggregation can be bundled into the Reduce processes of this3-way join,and the max-finding and division can be done simply by an extra set of processes that do not have the map-reduce form.1Although we shall not prove it here,a consequence of the theory we develop is that the3-way join is more efficient than two two-ways joins for small numbers of compute-nodes.In particular,the3-way join is preferable as long as the number of compute nodes does not exceed the ratio of the sizes of the relations M and A.This ratio is approximately the average fan-out of the Web,known to be approximately15.Figure1shows the algorithm we propose.Thefirst col-umn represents conventional Map processes for a3-way join; we leave the discussion about how the3-way join algorithm is implemented by map-reduce for Section2.The second column represents the Reduce part of the3-way join,but to save some communication,we can do(on the result of the Reduce process)a local projection,grouping and summing, so we do not have to transmit to the third column more than one authority score for any one page.We assume that pages are distributed among processes of the third column in such a way that all page-score pairs for a given page wind up at the same process.This distribution of data is exactly the same as what Hadoop does to distribute data from Map processes to Reduceprocesses.Map portion of 3−way join Reduceportionof 3−wayjoin, partialgroup andsumCompletegroup/sum,partialmaxMax DivideFigure1:The HITS algorithm as a network of pro-cessesThe third column completes the grouping and summing by combining information from different processes in the second column that pertain to the same page.This column also begins the max operation.Each process in that col-umn identifies the largest authority score among any of the pages that are assigned to that process,and that informa-tion is passed to the single process in the fourth column. That process completes the computation of the maximum authority score and distributes that value to all processes of thefifth column.Thefifth column of processes has the simple job of tak-ing some of the page-score pairs from the third column and normalizing the score by dividing by the value transmit-ted from the fourth column.Note we can arrange that the same compute node used for a process of the third column is also used for the same data in thefifth column.Thus, in practice no communication cost is incurred moving data from the third column to thefifth.The fact that our model counts the input sizes from both the third andfifth column does not affect the order-of-magnitude of the cost,although it is important to choose execution nodes wisely to avoid unnecessary communication whenever we can.2.MULTIWAY JOINSThere is a straightforward way to join relations using map-reduce.We begin with a discussion of this algorithm.We then consider a different way to join several relations in one map-reduce operation.2.1The Two-Way Join and Map-Reduce Suppose relations R(A,B)and S(B,C)are each stored in afile of the type described in Section1.1.To join these relations,we must associate each tuple from either relation with the key that is the value of its B-component.A col-lection of Map processes will turn each tuple(a,b)from R into a key-value pair with key b and value(a,R).Note that we include the relation with the value,so we can,in the Re-duce phase,match only tuples from R with tuples from S, and not a pair of tuples from R or a pair of tuples from S. Similarly,we use a collection of Map processes to turn each tuple(b,c)from S into a key-value pair with key b and value (c,S).Note that we include the relation with the value so in the reduce phase we only combine tuples from different relations.The role of the Reduce processes is to combine tuples from R and S that have a common B-value.Typically,we shall need many Reduce processes,so we need to arrange that all tuples with afixed B-value are sent to the same Reduce process.Suppose we use k Reduce processes.Then choose a hash function h that maps B-values into k buckets,each hash value corresponding to one of the Reduce processes. The output of any Map process with key b is sent to the Reduce process for hash value h(b).The Reduce processes write the joined tuples(a,b,c)that theyfind to a single outputfile.2.2Implementation Under HadoopIf the above algorithm is implemented in Hadoop,then the partition of keys according to the hash function h can be done behind the scenes.That is,you tell Hadoop the value of k you desire,and it will create k Reduce processes and partition the keys among them using a hash function. Further,it passes the key-value pairs to a Reduce process with the keys in sorted order.Thus,it is possible to imple-ment Reduce to take advantage of the fact that all tuples from R and S with afixed value of B will appear consecu-tively on the input.That feature is both good and bad.It allows a simpler implementation of Reduce,but the time spent by Hadoop in sorting the input to a Reduce process may be more than the time spent setting up the main-memory data structures that allow the Reduce processes tofind all the tuples with afixed value of B.2.3Joining Several Relations at OnceLet us consider joining three relationsR(A,B) S(B,C) T(C,D)We could implement this join by a sequence of two2-way joins,choosing either to join R and Sfirst,and then join T with the result,or to join S and Tfirst and then join with R.Both joins can be implemented by map-reduce as described in Section2.1.An alternative algorithm involves joining all three rela-tions at once,in a single map-reduce process.The Map processes send each tuple of R and T to many different Re-duce processes,although each tuple of S is sent to only oneReduce process.The duplication of data increases the com-munication cost above the theoretical minimum,but in com-pensation,we do not have to communicate the result of the first join.Much of this paper is devoted to optimizing the way this algorithm is implemented,but as an introduction,suppose we use k =m 2Reduce processes for some m .Values of B and C will each be hashed to m buckets,and each Reduce process will be associated with a pair of buckets,one for B and one for C .Let h be a hash function with range 1,2,...,m ,and asso-ciate each Reduce process with a pair (i,j ),where integers i and j are each between 1and m .Each tuple S (b,c )is sent to the Reduce process numbered h (b ),h (c ) .Each tupleR (a,b )is sent to all Reduce processes numbered h (b ),x,for any x .Each tuple T (c,d )is sent to all Reduce processes numbered y,h (c ) for any y .Thus,each process (i,j )gets 1/m 2th of S ,and 1/m th of R and T .An example,with m =4,is shown in Fig.2.h(S.b)=1 and h(S.c)=3Figure 2:Distributing tuples of R ,S ,and T among k =m 2processesEach Reduce process computes the join of the tuples it receives.It is easy to observe that if there are three tuples R (a,b ),S (b,c ),and T (c,d )that join,then they will all be sent to the Reduce process numbered h (b ),h (c ) .Thus,the algorithm computes the join correctly.2.4An Introductory Optimization Example and the Notion of ShareIn Section 2.3,we arbitrarily picked attributes B and C to form the map-key,and we chose to give B and C the same number of buckets,m =√k .This choice raises two questions:1.Why are only B and C part of the map-key?2.Is it best to give them the same number of buckets?To learn how to optimize map-keys for a multiway join,let us begin with a simple example:the cyclic joinR (A,B ) S (B,C ) T (A,C )Suppose that the target number of map-keys is k .That is,we shall use k Reduce processes to join tuples from the three relations.Each of the three attributes A ,B ,and C will have a share of the key,which we denote a ,b ,and c ,respectively.We assume there are hash functions that map values of attribute A to a different buckets,values of B to bbuckets,and values of C to c buckets.We use h as the hash function name,regardless of which attribute’s value is being hashed.Note that abc =k .•Convention :Throughout the paper,we use upper-case letters near the beginning of the alphabet for at-tributes and the corresponding lower-case letter as its share of a map-key.We refer to these variables a,b,...as share variables .Consider tuples (x,y )in relation R .Which Reduce pro-cesses need to know about this tuple?Recall that each Re-duce process is associated with a map-key (u,v,w ),where u is a hash value from 1to a representing a bucket into which A -values are hashed.Similarly,v is a bucket in the range 1to b representing a B -value,and w is a bucket in the range 1to c representing a C -value.Tuple (x,y )from R can only be useful to this reducer if h (x )=u and h (y )=v .However,it could be useful to any reducer that has these first two key components,regardless of the value of w .We conclude that (x,y )must be replicated and sent to the c different reducers corresponding to key values h (x ),h (y ),w,where 1≤w ≤c .Similar reasoning tells us that any tuple (y,z )from S must be sent to the a different reducers corresponding to map-keys u,h (y ),h (z ),for 1≤u ≤a .Finally,a tuple x,z )from T is sent to the b different reducers corresponding to map-keys h (x ),v,h (z ),for 1≤v ≤b .This replication of tuples has a communication cost asso-ciated with it.The number of tuples passed from the Map processes to the Reduce processes isrc +sa +tbwhere r ,s ,and t are the numbers of tuples in relations R ,S ,and T ,respectively.•Convention :We shall,in what follows,use R ,S,...as relation names and use the corresponding lower-case letter as the size of the relation.We must minimize the expression rc +sa +tb subject to the constraint that abc =k .There is another constraint that we shall not deal with immediately,but which eventually must be faced:each of a ,b ,and c must be a positive integer.To start,the method of Lagrangean multipliers serves us well.That is,we start with the expressionrc +sa +tb −λ(abc −k )take derivatives with respect to the three variables,a ,b ,and c ,and set the resulting expressions equal to 0.The result is three equations:s =λbc t =λac r =λabThese come from the derivatives with respect to a ,b ,and c in that order.If we multiply each equation by the vari-able missing from the right side (which is also the variable with respect to which we took the derivative to obtain that equation),and remember that abc equals the constant k ,we get:sa =λk tb =λk rc =λkWe shall refer to equations derived this way as the La-grangean equations .If we multiply the left sides of the three equations and set that equal to the product of the right sides,we get rstk =λ3k 3(remembering that abc on the left equals k ).We can now solve for λ=3 rst/k 2.From this,the first equation sa =λk yields a =3 krt/s 2.Similarly,the next two equations yield b =3 and c =3When we substitute these values in the original expression to be optimized,rc +sa +tb ,we get the minimum amount of com-munication between Map and Reduce processes:33√Note that the values of a ,b ,and c are not necessarily integers.However,the values derived tell us approximately which integers the share variables need to be.They also tell us the desired ratios of the share variables;for example,a/b =t/s .In fact,the share variable for each attribute is inversely proportional to the size of the relation from whose schema the attribute is missing.This rule makes sense,as it says we should equalize the cost of distributing each of the relations to the Reduce processes.These ratios also let us pick good integer approximations to a ,b ,and c ,as well as a value of k that is in the approximate range we want and is the product abc .2.5Comparison With Cascade of JoinsUnder what circumstances is this 3-way join implemented by map-reduce a better choice than a cascade of two 2-way joins,each implemented by map-reduce.As usual,we shall not count the cost of producing the final result,since this result,if it is large,will likely be input to another operator such as aggregation,that reduces the size of the output.To simplify the calculation,we shall assume that all three relations have the same size r .For example,they might each be the incidence matrix of the Web,and the cyclic query is asking for cycles of length 3in the Web (this may be useful,for example,in helping us identify certain kinds of spam farms).If r =s =t ,the communication between the Map and Reduce processes simplifies to 3r 3√We shall also assume that the probability of two tuples from different relations agreeing on their common attribute is p .For example,if the relations are incidence matrices of the Web,then rp equals the average out-degree of pages,which might be in the 10–15range.The communication of the optimal 3-way join is:1.3r for input to the Map processes.2.3r 3√k for the input to the Reduce processes.The second term dominates,so the total communication cost for the 3-way join is O (r 3√For the cascade of 2-way joins,whichever two we join first,we get an input size for the first Map processes of 2r .This figure is also the input to the first Reduce processes,and the output size for the Reduce processes is r 2p .Thus,the second join’s Map processes have an input size of r 2p for the intermediate join and r for the third relation.This figure is also the input size for the Reduce processes associated with the second join,and we do not count the size of the output from those processes.Assuming rp >1,the r 2p term dominates,and the cascade of 2-way joins has total communication cost O (r 2p ).We must thus compare r 2p with the cost of the 3-wayjoin,which we found to be O (r 3√That is,the 3-way joinwill be better as long as 3√k is less than rp .Since r and p are properties of the data,while k is a parameter of the join algorithm that we may choose,the conclusion of this analysis is that there is a limit on how large k can be in order for the 3-way join to be the method of choice.This limit is k <(rp )3.For example,if rp =15,as might be the case for the Web incidence matrix,then we can pick k up to 3375,and use that number of Reduce processes.Example 2.1.Suppose r =107,p =10−5,and k =1000.Then the cost of the cascade of 2-way joins is r 2p =109.The cost of the 3-way join is r 3√k =108,which is much less.Note also that the output size is small compared with both.Because there are three attributes that have to match to make a tuple in R (A,B ) S (B,C ) T (A,C ),the output size is r 3p 3=106.2.6Trade-Off Between Speed and CostBefore moving on to the general problem of optimizing multiway joins,let us observe that the example of Section 2.4illustrates the trade-offthat we face when using a method that replicates input.We saw that the total communication cost was O (3√What is the elapsed communication cost?First,there is no limit on the number of Map processes we can use,as long as each process gets at least one chunk of input.Thus,we can ignore the elapsed cost of the Map processes and concentrate on the k Reduce processes.Since the hash function used will divide the tuples of the relations randomly,we do not expect there to be much skew,except in some extreme cases.Thus,we can estimate the elapsed communication cost as 1/k th of the total communication cost,or O (3rst/k 2).Thus,while the total cost grows as k 1/3,the elapsed cost shrinks as k 2/3.That is,the faster we want the join com-puted,the more resources we consume.3.OPTIMIZATION OF MULTIW AY JOINSNow,let us see how the example of Section 2.4generalizes to arbitary natural joins.We shall again start out with an example that illustrates why certain attributes should not be allowed to have a share of the map-key.We then look at more complex situations where the equations we derive from the Lagrangean method do not have a feasible solution,and we show how it is possible to resolve those problems by eliminating attributes from the map-key.3.1A Preliminary Algorithm for Optimizing Share VariablesHere is an algorithm that generalizes the technique of Sec-tion 2.4.As we shall see,it sometimes yields a solution and sometimes not.Most of the rest of this section is devoted to fixing up the cases where it does not.Suppose that we want to compute the natural join of relations R 1,R 2,...,R n ,and the attributes appearing among the relation schemas are A 1,A 2,...,A m .Step 1:Start with the cost expressionτ1+τ2+···+τn −λ(a 1a 2···a m −k )where τi is the term that represents the cost of communi-cating tuples of relations R i to the Reduce processes that。

相关文档
最新文档